Data exploration
===========================

In this tutorial, we will analyze a dataset related to diabetes using the crandas library. We will import the necessary libraries, load the datasets, and then perform some data analysis tasks. 

First, let's import the required libraries:

.. code-block:: python

    import crandas as cd
    import pandas as pd
    import math
    from tabulate import tabulate
    import matplotlib.pyplot as plt


    # Set the colours for our MatPlotLib charts
    plt.rcParams['axes.prop_cycle'] = plt.cycler('color', ['#ec297b', '#4f89d6', '#03186b', '#343434', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf'])
    
    # On a jupyter environment provided by RosemanLabs, session variables are set automatically in the background
    # Set the session base_path and session.endpoint manually when executing this notebook in any other environment
    # Uncomment the following 4 commented lines and change their value to set these session variables

    # from crandas.base import session
    # from pathlib import Path

    # session.base_path = Path('base/path/to/vdl/secrets') 
    # session.endpoint = 'https://localhost:9820/api/v1'

Step 1: Accessing the Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now, let's load the datasets. There are three CSV files that we will be working with. We assume that each of them contains medical data about patients with diabetes, provided by different hospitals.

.. note::
    In a real world scenario, each hospital would upload their own data and here we would simply retrieve it from the VDL. In this example, we simply upload all three databases ourselves.

.. code-block:: python

    file1 = "data/diabetes_data/diabetes_dummy_1.csv"
    file2 = "data/diabetes_data/diabetes_dummy_2.csv"
    file3 = "data/diabetes_data/diabetes_dummy_3.csv"

    dataset_1 = cd.read_csv(file1, name="dataset_1")
    dataset_2 = cd.read_csv(file2, name="dataset_2")
    dataset_3 = cd.read_csv(file3, name="dataset_3")

Let's check the structure of the first dataset:


>>> print(repr(dataset_1))
    Name: 753CEC55C6CDD2E2B6D98455E0C35F6D57837B103252E864853403FAF3A47728
    Size: 100 rows x 18 columns
    CIndex([Col("Patientnr", "i", 1), Col("M/V", "i", 1), Col("Leeftijd", "i", 1), Col("patient_sims", "i", 1), Col("zorgprofiel", "i", 1), Col("Hoofdbehandelaar", "i", 1), Col("Podotherapeut huisbezoek", "i", 1), Col("Pedicure huisbezoek", "i", 1), Col("DM Type", "i", 1), Col("Tekenen van infectie", "i", 1), Col("Ulcus/amputatie", "i", 1), Col("Inactieve charcot-voet", "i", 1), Col("Nierfalen/Dialyse ", "i", 1), Col("Pulsaties links ", "i", 1), Col("Pulsaties rechts", "i", 1), Col("Doppler links", "i", 1), Col("Doppler rechts", "i", 1), Col("Huidig schoeisel", "i", 1)])


Step 2: Exploring the Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

First, we want to find how much of each dataset contains information of Type 1 diabetes and how much of it contains information about Type 2. For this, we filter by type using ``"dataset[DM Type"]==1`` and sum the total number of entries using ``sum``. Then we divide that by the size of each dataset.

.. code-block:: python

    # Print the question
    print("What is the relative ratio between both types of diabetes in each dataset? \n")

    # Define the column headers for the table
    headers=["Dataset", "% DM Type 1", "% DM Type 2"]

    # Define the data to be displayed in the table
    table = [["Dataset 1",((sum(dataset_1["DM Type"]==1)/len(dataset_1))*100), ((sum(dataset_1["DM Type"]==2)/len(dataset_1))*100)],
            ["Dataset 2",((sum(dataset_2["DM Type"]==1)/len(dataset_2))*100), ((sum(dataset_2["DM Type"]==2)/len(dataset_2))*100)],
            ["Dataset 3",((sum(dataset_3["DM Type"]==1)/len(dataset_3))*100), ((sum(dataset_3["DM Type"]==2)/len(dataset_3))*100)]]

    # Format the table using the tabulate library and print it
    print(tabulate(table, headers=headers, tablefmt='psql'))


.. parsed-literal::

    Question 1: What is the relative ratio between both types of diabetes in each dataset?

        +-----------+---------------+---------------+
    | Dataset   |   % DM Type 1 |   % DM Type 2 |
    |-----------+---------------+---------------|
    | Dataset 1 |            50 |            50 |
    | Dataset 2 |            54 |            46 |
    | Dataset 3 |            47 |            53 |
    +-----------+---------------+---------------+
    
Different cases of the same type require different care profiles. Next we can create a pie chart to visualize the relative distribution of such care profiles within Type 1 Diabetes for dataset 1. We feed our open data to a plotting package to display it.

.. code:: python

    # Print the question to be answered
    print("\nQuestion 2: What is the relative distribution of care profiles within a certain type of diabetes? \n")

    # Specify the dataset being analyzed
    print("Dataset 1 - DM Type 1")

    # Define the labels for the pie chart
    labels = 'Care profile 2', 'Care profile 3', 'Care profile 4'

    # Define the sizes of each slice of the pie chart
    sizes = [sum((dataset_1["DM Type"]==1) & (dataset_1["zorgprofiel"]==2)),
            sum((dataset_1["DM Type"]==1) & (dataset_1["zorgprofiel"]==3)),
            sum((dataset_1["DM Type"]==1) & (dataset_1["zorgprofiel"]==4))]

    # Specify which slice (if any) to separate from the rest of the chart - only "explode" the 2nd slice 
    explode = (0, 0, 0) 
        
    # Create a figure with a single subplot and plot the pie chart
    fig1, ax1 = plt.subplots()
    patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                            shadow=True, startangle=90)
    plt.setp(pcts, color='white', fontweight=600)

    # Ensure that the pie chart is drawn as a circle
    ax1.axis('equal')  
        
    # Set the title of the pie chart
    plt.title("Distribution of care profiles for DM Type 1 in Dataset 1") 
        
    # Display the pie chart
    plt.show()

.. parsed-literal::

    
    Question 2: What is the relative distribution of care profiles within a certain type of diabetes? 
    
    Dataset 1 - DM Type 1
    

.. image:: diabetes_statistics_files/diabetes_statistics_5_1.png

Now we do the same for Type 2 in dataset 1: 

.. code:: python

    
    # Specify the dataset being analyzed
    print("\nDataset 1 - DM Type 2")

    # Define the labels for the pie chart
    labels = 'Care profile 2', 'Care profile 3', 'Care profile 4'

    # Define the sizes of each slice of the pie chart
    sizes = [sum((dataset_1["DM Type"]==2) & (dataset_1["zorgprofiel"]==2)),
            sum((dataset_1["DM Type"]==2) & (dataset_1["zorgprofiel"]==3)),
            sum((dataset_1["DM Type"]==2) & (dataset_1["zorgprofiel"]==4))]

    # Specify which slice (if any) to separate from the rest of the chart
    explode = (0, 0, 0)  
        
    # Create a figure with a single subplot and plot the pie chart
    fig1, ax1 = plt.subplots()
    patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                                shadow=True, startangle=90)
    plt.setp(pcts, color='white', fontweight=600)

    # Ensure that the pie chart is drawn as a circle
    ax1.axis('equal')  

    # Set the title of the pie chart
    plt.title("Distribution of care profiles for DM Type 2 in Dataset 1") 

    # Display the pie chart
    plt.show()


.. parsed-literal::

    
    Dataset 1 - DM Type 2
    

.. image:: diabetes_statistics_files/diabetes_statistics_6_1.png

After this we move on to dataset 2 and do the same as above. First Type 1: 

.. code:: python

    
    # Specify the dataset being analyzed
    print("\nDataset 2 - DM Type 1")

    # Define the labels for the pie chart
    labels = 'Care profile 2', 'Care profile 3', 'Care profile 4'

    # Define the sizes of each slice of the pie chart
    sizes = [sum((dataset_2["DM Type"]==1) & (dataset_2["zorgprofiel"]==2)),
            sum((dataset_2["DM Type"]==1) & (dataset_2["zorgprofiel"]==3)),
            sum((dataset_2["DM Type"]==1) & (dataset_2["zorgprofiel"]==4))]

    # Specify which slice (if any) to separate from the rest of the chart
    explode = (0, 0, 0)  
        
    # Create a figure with a single subplot and plot the pie chart
    fig1, ax1 = plt.subplots()
    patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                                shadow=True, startangle=90)
    plt.setp(pcts, color='white', fontweight=600)

    # Ensure that the pie chart is drawn as a circle
    ax1.axis('equal')  

    # Set the title of the pie chart
    plt.title("Distribution of care profiles for DM Type 1 in Dataset 2") 

    # Display the pie chart
    plt.show()

.. parsed-literal::

    
    Dataset 2 - DM Type 1
    

.. image:: diabetes_statistics_files/diabetes_statistics_7_1.png

Now Type 2 for dataset 2: 

.. code:: python

    # Specify the dataset being analyzed
    print("\nDataset 2 - DM Type 2")

    # Define the labels for the pie chart
    labels = 'Care profile 2', 'Care profile 3', 'Care profile 4'

    # Define the sizes of each slice of the pie chart
    sizes = [sum((dataset_2["DM Type"]==2) & (dataset_2["zorgprofiel"]==2)),
            sum((dataset_2["DM Type"]==2) & (dataset_2["zorgprofiel"]==3)),
            sum((dataset_2["DM Type"]==2) & (dataset_2["zorgprofiel"]==4))]

    # Specify which slice (if any) to separate from the rest of the chart
    explode = (0, 0, 0)  
        
    # Create a figure with a single subplot and plot the pie chart
    fig1, ax1 = plt.subplots()
    patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                                shadow=True, startangle=90)
    plt.setp(pcts, color='white', fontweight=600)

    # Ensure that the pie chart is drawn as a circle
    ax1.axis('equal')  

    # Set the title of the pie chart
    plt.title("Distribution of care profiles for DM Type 2 in Dataset 2") 

    # Display the pie chart
    plt.show()

.. parsed-literal::

    Dataset 2 - DM Type 2

.. image:: diabetes_statistics_files/diabetes_statistics_8_1.png

Next, DM Type 1 for dataset 3:

.. code:: python
    
    # Specify the dataset being analyzed
    print("\nDataset 3 - DM Type 1")

    # Define the labels for the pie chart
    labels = 'Care profile 2', 'Care profile 3', 'Care profile 4'

    # Define the sizes of each slice of the pie chart
    sizes = [sum((dataset_3["DM Type"]==1) & (dataset_3["zorgprofiel"]==2)),
            sum((dataset_3["DM Type"]==1) & (dataset_3["zorgprofiel"]==3)),
            sum((dataset_3["DM Type"]==1) & (dataset_3["zorgprofiel"]==4))]

    # Specify which slice (if any) to separate from the rest of the chart
    explode = (0, 0, 0)  
        
    # Create a figure with a single subplot and plot the pie chart
    fig1, ax1 = plt.subplots()
    patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                                shadow=True, startangle=90)
    plt.setp(pcts, color='white', fontweight=600)

    # Ensure that the pie chart is drawn as a circle
    ax1.axis('equal')  

    # Set the title of the pie chart
    plt.title("Distribution of care profiles for DM Type 1 in Dataset 3") 

    # Display the pie chart
    plt.show()


.. parsed-literal::

    Dataset 3 - DM Type 1

.. image:: diabetes_statistics_files/diabetes_statistics_9_1.png

Finally, for DM Type 2 in dataset 3:

.. code:: python

    # Specify the dataset being analyzed
    print("\nDataset 3 - DM Type 2")

    # Define the labels for the pie chart
    labels = 'Care profile 2', 'Care profile 3', 'Care profile 4'

    # Define the sizes of each slice of the pie chart
    sizes = [sum((dataset_3["DM Type"]==2) & (dataset_3["zorgprofiel"]==2)),
            sum((dataset_3["DM Type"]==2) & (dataset_3["zorgprofiel"]==3)),
            sum((dataset_3["DM Type"]==2) & (dataset_3["zorgprofiel"]==4))]

    # Specify which slice (if any) to separate from the rest of the chart
    explode = (0, 0, 0)  
        
    # Create a figure with a single subplot and plot the pie chart
    fig1, ax1 = plt.subplots()
    patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                                shadow=True, startangle=90)
    plt.setp(pcts, color='white', fontweight=600)

    # Ensure that the pie chart is drawn as a circle
    ax1.axis('equal')  

    # Set the title of the pie chart
    plt.title("Distribution of care profiles for DM Type 2 in Dataset 3") 

    # Display the pie chart
    plt.show()


.. parsed-literal::

    
    Dataset 3 - DM Type 2
    

.. image:: diabetes_statistics_files/diabetes_statistics_10_1.png


Step 3: Looking Deeper Into the Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the next part of this tutorial, we will analyze the datasets and get and see how knowing the shape of our data is fundamental for our analysis and the importance of correctly cleaning and labelling

Let's say we want to perform a similar as before, but focusing on the gender of the participants instead of diabetes type. First, we explore the ``M/V`` column (for the Dutch `Man/Vrouw` for `Man/Woman`). 

.. code:: python

    # Define table headers
    headers= ['Dataset', '% Men', '% Women']

    # Calculate the relative proportions of mean and women in each dataset
    table = [["Dataset 1",(sum(dataset_1["M/V"]==0)/len(dataset_1)*100),(sum(dataset_1["M/V"]==1)/len(dataset_1)*100)],["Dataset 2",(sum(dataset_2["M/V"]==0)/len(dataset_2)*100),(sum(dataset_2["M/V"]==1)/len(dataset_2)*100)],["Dataset 3",(sum(dataset_3["M/V"]==0)/len(dataset_3)*100),(sum(dataset_3["M/V"]==1)/len(dataset_3)*100)]]


    # Print the table with the calculated proportions using the tabulate library
>>> print(tabulate(table, headers=headers, tablefmt='psql')) 
    +-----------+---------+-----------+
    | Dataset   |   % Men |   % Women |
    |-----------+---------+-----------|
    | Dataset 1 |      53 |        47 |
    | Dataset 2 |      37 |        33 |
    | Dataset 3 |      42 |        58 |
    +-----------+---------+-----------+


This result is strange, we are computing the percentage of men and women, yet the percentages in Dataset 2 do not add up to one hundred. There is something unusual in the database for that source. As we do not have direct access to the data, we can try to see if we have similar data. using the same structure of queries, let's see if the database has some values that are close, like ``-1`` or ``2``.

.. code:: python

    # Define table headers
    headers= ['0', '1', '-1','2']

    # Count the number of entries of each type
    table = [[(sum(dataset_2["M/V"]==0)),(sum(dataset_2["M/V"]==1)),(sum(dataset_2["M/V"]==-1)),(sum(dataset_2["M/V"]==2))]]


    # Print the table with the calculated proportions using the tabulate library
>>> print(tabulate(table, headers=headers, tablefmt='psql')) 
    +-----+-----+------+-----+
    |   0 |   1 |   -1 |   2 |
    |-----+-----+------+-----|
    |  37 |  33 |    0 |  30 |
    +-----+-----+------+-----+

We got lucky here, we found that the remaining rows have the value ``2``. Of course, in order to know what this means we must contact the data owner to explain the meaning of the data. In this case, we know that Hospital 2 marked whether the information was filled out by the caregiver in the ``M/V``. [1]_ This means that we have lost data in that case. We can still filter using only the values for zero and one.


Now let's create a pie chart to visualize the relative distribution of care profiles for men in dataset 1. 

.. code:: python

    print("\n\n\nQuestion 4: What is the relative distribution of care profiles within a certain gender \n")
        
    print("Dataset 1 - Men")

    # Define the labels for the pie chart
    labels = 'Care profile 2', 'Care profile 3', 'Care profile 4'

    # Define the sizes of each slice in the pie chart
    sizes = [sum((dataset_1["M/V"]==0) & (dataset_1["zorgprofiel"]==2)),
            sum((dataset_1["M/V"]==0) & (dataset_1["zorgprofiel"]==3)),
            sum((dataset_1["M/V"]==0) & (dataset_1["zorgprofiel"]==4))]

    # Define the amount of "explode" for each slice (i.e. how far apart it should be from the center)
    explode = (0, 0, 0)

    # Create a new figure and axes for the pie chart
    fig1, ax1 = plt.subplots()

    # Create the pie chart using the defined parameters
    patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                                shadow=True, startangle=90)
    plt.setp(pcts, color='white', fontweight=600)

    # Set the aspect ratio of the axes to be equal so that the pie chart is circular
    ax1.axis('equal')

    # Add a title to the pie chart
    plt.title("Distribution of care profiles for Men in Dataset 1") 

    # Display the pie chart
    plt.show()

.. parsed-literal::

    
    Question 4: What is the relative distribution of care profiles within a certain gender
    
    Dataset 1 - Men
    
.. image:: diabetes_statistics_files/diabetes_statistics_12_1.png

Next, we can take a look using a pie chart for women in dataset 1:

.. code:: python

    print("\nDataset 1 - Women")

    # Define the labels for the pie chart
    labels = 'Care profile 2', 'Care profile 3', 'Care profile 4'

    # Define the sizes of each slice in the pie chart
    sizes = [sum((dataset_1["M/V"]==1) & (dataset_1["zorgprofiel"]==2)),
            sum((dataset_1["M/V"]==1) & (dataset_1["zorgprofiel"]==3)),
            sum((dataset_1["M/V"]==1) & (dataset_1["zorgprofiel"]==4))]

    # Define the amount of "explode" for each slice (i.e. how far apart it should be from the center)
    explode = (0, 0, 0)

    # Create a new figure and axes for the pie chart
    fig1, ax1 = plt.subplots()

    # Create the pie chart using the defined parameters
    patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                                shadow=True, startangle=90)
    plt.setp(pcts, color='white', fontweight=600)

    # Set the aspect ratio of the axes to be equal so that the pie chart is circular
    ax1.axis('equal')

    # Add a title to the pie chart
    plt.title("Distribution of care profiles for Women in Dataset 1") 

    # Display the pie chart
    plt.show()


.. parsed-literal::

    
    Dataset 1 - Women
    

.. image:: diabetes_statistics_files/diabetes_statistics_13_1.png

Next, for men in dataset 2:

.. code:: python

    
    print("\nDataset 2 - Men")
    
    labels = 'Care profile 2', 'Care profile 3', 'Care profile 4'
    sizes = [sum((dataset_2["M/V"]==0) & (dataset_2["zorgprofiel"]==2)),
    sum((dataset_2["M/V"]==0) & (dataset_2["zorgprofiel"]==3)),
    sum((dataset_2["M/V"]==0) & (dataset_2["zorgprofiel"]==4))]
    
    explode = (0, 0, 0)  # only "explode" the 2nd slice 
    
    fig1, ax1 = plt.subplots()
    patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                                shadow=True, startangle=90)
    plt.setp(pcts, color='white', fontweight=600)
    ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
    
    plt.title("Distribution of care profiles for Men in Dataset 2") 
    
    plt.show()
    

.. parsed-literal::

    
    Dataset 2 - Men
    

.. image:: diabetes_statistics_files/diabetes_statistics_14_1.png

Next, the percentage distribution of different care profiles for women in dataset 2:

.. code:: python

    print("\nDataset 2 - Women")

    # Define the labels for the pie chart
    labels = 'Care profile 2', 'Care profile 3', 'Care profile 4'

    # Define the sizes of each slice in the pie chart
    sizes = [sum((dataset_2["M/V"]==1) & (dataset_2["zorgprofiel"]==2)),
            sum((dataset_2["M/V"]==1) & (dataset_2["zorgprofiel"]==3)),
            sum((dataset_2["M/V"]==1) & (dataset_2["zorgprofiel"]==4))]

    # Define the amount of "explode" for each slice (i.e. how far apart it should be from the center)
    explode = (0, 0, 0)

    # Create a new figure and axes for the pie chart
    fig1, ax1 = plt.subplots()

    # Create the pie chart using the defined parameters
    patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                                shadow=True, startangle=90)
    plt.setp(pcts, color='white', fontweight=600)

    # Set the aspect ratio of the axes to be equal so that the pie chart is circular
    ax1.axis('equal')

    # Add a title to the pie chart
    plt.title("Distribution of care profiles for Women in Dataset 2") 

    # Display the pie chart
    plt.show()

    
.. parsed-literal::

    
    Dataset 2 - Women
    

.. image:: diabetes_statistics_files/diabetes_statistics_15_1.png

The following is an interesting example of properly cleaning and defining the data before uploading it to the VDL 
Here we will visualize the distribution of care profiles for parents/guardians in dataset 2: 

.. code:: python

    # Print title
    print("\nDataset 2 - Parents/Carers")

    # Define labels and sizes for pie chart
    labels = 'Care profile 2', 'Care profile 3', 'Care profile 4'
    sizes = [sum((dataset_2["M/V"]==2) & (dataset_2["zorgprofiel"]==2)),
            sum((dataset_2["M/V"]==2) & (dataset_2["zorgprofiel"]==3)),
            sum((dataset_2["M/V"]==2) & (dataset_2["zorgprofiel"]==4))]

    # Set explode parameter for pie chart
    explode = (0, 0, 0)  # only "explode" the 2nd slice 

    # Create pie chart with labels, sizes, explode, and formatting parameters
    fig1, ax1 = plt.subplots()
    patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                                shadow=True, startangle=90)
    plt.setp(pcts, color='white', fontweight=600)
    ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

    # Set title for pie chart
    plt.title("Distribution of care profiles for Parents/Carers in Dataset 2") 

    # Display pie chart
    plt.show()
    

.. parsed-literal::
    
    Dataset 2 - Parents/Carers

.. image:: diabetes_statistics_files/diabetes_statistics_16_1.png

Next, for the distribution of care profiles within men in dataset 3:

.. code:: python

    
    print("\nDataset 3 - Men")
    
    labels = 'Care profile 2', 'Care profile 3', 'Care profile 4'
    sizes = [sum((dataset_3["M/V"]==0) & (dataset_3["zorgprofiel"]==2)),
    sum((dataset_3["M/V"]==0) & (dataset_3["zorgprofiel"]==3)),
    sum((dataset_3["M/V"]==0) & (dataset_3["zorgprofiel"]==4))]
    
    explode = (0, 0, 0)  # only "explode" the 2nd slice 
    
    fig1, ax1 = plt.subplots()
    patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                                shadow=True, startangle=90)
    plt.setp(pcts, color='white', fontweight=600)
    ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
    
    plt.title("Distribution of care profiles for Men in Dataset 3") 
    
    plt.show()
    

.. parsed-literal::

    
    Dataset 3 - Men
    

.. image:: diabetes_statistics_files/diabetes_statistics_17_1.png

and finally, the distribution of care profiles within women in dataset 3: 

.. code:: python

    
    print("\nDataset 3 - Women")
    
    labels = 'Care profile 2', 'Care profile 3', 'Care profile 4'
    sizes = [sum((dataset_3["M/V"]==1) & (dataset_3["zorgprofiel"]==2)),
    sum((dataset_3["M/V"]==1) & (dataset_3["zorgprofiel"]==3)),
    sum((dataset_3["M/V"]==1) & (dataset_3["zorgprofiel"]==4))]
    
    explode = (0, 0, 0)  # only "explode" the 2nd slice 
    
    fig1, ax1 = plt.subplots()
    patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                                shadow=True, startangle=90)
    plt.setp(pcts, color='white', fontweight=600)
    ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
    
    plt.title("Distribution of care profiles for Women in Dataset 3") 
    
    plt.show()


.. parsed-literal::

    
    Dataset 3 - Women
    

.. image:: diabetes_statistics_files/diabetes_statistics_18_1.png

To end everything, let's calculate and display the average age and standard deviation for each of the 3 datasets - and display it in a table:

.. code:: python


    # Print question header
    print("\n\n\nQuestion 5: What is the mean 'Age' and standard deviation of the sample? \n")

    # Define column headers for table
    headers = ["Dataset", "Mean age", "σ age"]

    # Create table as a list of lists with dataset names, mean ages, and standard deviations
    table = [["Dataset 1", dataset_1['Leeftijd'].mean(), math.sqrt(dataset_1['Leeftijd'].var())],
            ["Dataset 2", dataset_2['Leeftijd'].mean(), math.sqrt(dataset_2['Leeftijd'].var())],
            ["Dataset 3", dataset_3['Leeftijd'].mean(), math.sqrt(dataset_3['Leeftijd'].var())]]

    # Format and display table using tabulate library
    print(tabulate(table, headers=headers, tablefmt='psql'))


.. parsed-literal::

    
    Question 5: What is the mean 'Age' and standard deviation of the sample?
    
    +-----------+-----------------------+--------------+
    | Dataset   |              Mean age |        σ age |
    |-----------+-----------------------+--------------|
    | Dataset 1 |                 50.53 |      26.8883 |
    | Dataset 2 |                 56.69 |      27.1337 |
    | Dataset 3 |                 53.73 |      27.3912 |
    +-----------+-----------------------+--------------+

.. [1] Why? We don't know, unfortunately those things happen.