Data exploration =========================== In this tutorial, we will analyze a dataset related to diabetes using the crandas library. We will import the necessary libraries, load the datasets, and then perform some data analysis tasks. First, let's import the required libraries: .. code-block:: python import crandas as cd import pandas as pd import math from tabulate import tabulate import matplotlib.pyplot as plt # Set the colours for our MatPlotLib charts plt.rcParams['axes.prop_cycle'] = plt.cycler('color', ['#ec297b', '#4f89d6', '#03186b', '#343434', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']) # On a jupyter environment provided by RosemanLabs, session variables are set automatically in the background # Set the session base_path and session.endpoint manually when executing this notebook in any other environment # Uncomment the following 4 commented lines and change their value to set these session variables # from crandas.base import session # from pathlib import Path # session.base_path = Path('base/path/to/vdl/secrets') # session.endpoint = 'https://localhost:9820/api/v1' Step 1: Accessing the Data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Now, let's load the datasets. There are three CSV files that we will be working with. We assume that each of them contains medical data about patients with diabetes, provided by different hospitals. .. note:: In a real world scenario, each hospital would upload their own data and here we would simply retrieve it from the VDL. In this example, we simply upload all three databases ourselves. .. code-block:: python file1 = "data/diabetes_data/diabetes_dummy_1.csv" file2 = "data/diabetes_data/diabetes_dummy_2.csv" file3 = "data/diabetes_data/diabetes_dummy_3.csv" dataset_1 = cd.read_csv(file1, name="dataset_1") dataset_2 = cd.read_csv(file2, name="dataset_2") dataset_3 = cd.read_csv(file3, name="dataset_3") Let's check the structure of the first dataset: >>> print(repr(dataset_1)) Name: 753CEC55C6CDD2E2B6D98455E0C35F6D57837B103252E864853403FAF3A47728 Size: 100 rows x 18 columns CIndex([Col("Patientnr", "i", 1), Col("M/V", "i", 1), Col("Leeftijd", "i", 1), Col("patient_sims", "i", 1), Col("zorgprofiel", "i", 1), Col("Hoofdbehandelaar", "i", 1), Col("Podotherapeut huisbezoek", "i", 1), Col("Pedicure huisbezoek", "i", 1), Col("DM Type", "i", 1), Col("Tekenen van infectie", "i", 1), Col("Ulcus/amputatie", "i", 1), Col("Inactieve charcot-voet", "i", 1), Col("Nierfalen/Dialyse ", "i", 1), Col("Pulsaties links ", "i", 1), Col("Pulsaties rechts", "i", 1), Col("Doppler links", "i", 1), Col("Doppler rechts", "i", 1), Col("Huidig schoeisel", "i", 1)]) Step 2: Exploring the Data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First, we want to find how much of each dataset contains information of Type 1 diabetes and how much of it contains information about Type 2. For this, we filter by type using ``"dataset[DM Type"]==1`` and sum the total number of entries using ``sum``. Then we divide that by the size of each dataset. .. code-block:: python # Print the question print("What is the relative ratio between both types of diabetes in each dataset? \n") # Define the column headers for the table headers=["Dataset", "% DM Type 1", "% DM Type 2"] # Define the data to be displayed in the table table = [["Dataset 1",((sum(dataset_1["DM Type"]==1)/len(dataset_1))*100), ((sum(dataset_1["DM Type"]==2)/len(dataset_1))*100)], ["Dataset 2",((sum(dataset_2["DM Type"]==1)/len(dataset_2))*100), ((sum(dataset_2["DM Type"]==2)/len(dataset_2))*100)], ["Dataset 3",((sum(dataset_3["DM Type"]==1)/len(dataset_3))*100), ((sum(dataset_3["DM Type"]==2)/len(dataset_3))*100)]] # Format the table using the tabulate library and print it print(tabulate(table, headers=headers, tablefmt='psql')) .. parsed-literal:: Question 1: What is the relative ratio between both types of diabetes in each dataset? +-----------+---------------+---------------+ | Dataset | % DM Type 1 | % DM Type 2 | |-----------+---------------+---------------| | Dataset 1 | 50 | 50 | | Dataset 2 | 54 | 46 | | Dataset 3 | 47 | 53 | +-----------+---------------+---------------+ Different cases of the same type require different care profiles. Next we can create a pie chart to visualize the relative distribution of such care profiles within Type 1 Diabetes for dataset 1. We feed our open data to a plotting package to display it. .. code:: python # Print the question to be answered print("\nQuestion 2: What is the relative distribution of care profiles within a certain type of diabetes? \n") # Specify the dataset being analyzed print("Dataset 1 - DM Type 1") # Define the labels for the pie chart labels = 'Care profile 2', 'Care profile 3', 'Care profile 4' # Define the sizes of each slice of the pie chart sizes = [sum((dataset_1["DM Type"]==1) & (dataset_1["zorgprofiel"]==2)), sum((dataset_1["DM Type"]==1) & (dataset_1["zorgprofiel"]==3)), sum((dataset_1["DM Type"]==1) & (dataset_1["zorgprofiel"]==4))] # Specify which slice (if any) to separate from the rest of the chart - only "explode" the 2nd slice explode = (0, 0, 0) # Create a figure with a single subplot and plot the pie chart fig1, ax1 = plt.subplots() patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.setp(pcts, color='white', fontweight=600) # Ensure that the pie chart is drawn as a circle ax1.axis('equal') # Set the title of the pie chart plt.title("Distribution of care profiles for DM Type 1 in Dataset 1") # Display the pie chart plt.show() .. parsed-literal:: Question 2: What is the relative distribution of care profiles within a certain type of diabetes? Dataset 1 - DM Type 1 .. image:: diabetes_statistics_files/diabetes_statistics_5_1.png Now we do the same for Type 2 in dataset 1: .. code:: python # Specify the dataset being analyzed print("\nDataset 1 - DM Type 2") # Define the labels for the pie chart labels = 'Care profile 2', 'Care profile 3', 'Care profile 4' # Define the sizes of each slice of the pie chart sizes = [sum((dataset_1["DM Type"]==2) & (dataset_1["zorgprofiel"]==2)), sum((dataset_1["DM Type"]==2) & (dataset_1["zorgprofiel"]==3)), sum((dataset_1["DM Type"]==2) & (dataset_1["zorgprofiel"]==4))] # Specify which slice (if any) to separate from the rest of the chart explode = (0, 0, 0) # Create a figure with a single subplot and plot the pie chart fig1, ax1 = plt.subplots() patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.setp(pcts, color='white', fontweight=600) # Ensure that the pie chart is drawn as a circle ax1.axis('equal') # Set the title of the pie chart plt.title("Distribution of care profiles for DM Type 2 in Dataset 1") # Display the pie chart plt.show() .. parsed-literal:: Dataset 1 - DM Type 2 .. image:: diabetes_statistics_files/diabetes_statistics_6_1.png After this we move on to dataset 2 and do the same as above. First Type 1: .. code:: python # Specify the dataset being analyzed print("\nDataset 2 - DM Type 1") # Define the labels for the pie chart labels = 'Care profile 2', 'Care profile 3', 'Care profile 4' # Define the sizes of each slice of the pie chart sizes = [sum((dataset_2["DM Type"]==1) & (dataset_2["zorgprofiel"]==2)), sum((dataset_2["DM Type"]==1) & (dataset_2["zorgprofiel"]==3)), sum((dataset_2["DM Type"]==1) & (dataset_2["zorgprofiel"]==4))] # Specify which slice (if any) to separate from the rest of the chart explode = (0, 0, 0) # Create a figure with a single subplot and plot the pie chart fig1, ax1 = plt.subplots() patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.setp(pcts, color='white', fontweight=600) # Ensure that the pie chart is drawn as a circle ax1.axis('equal') # Set the title of the pie chart plt.title("Distribution of care profiles for DM Type 1 in Dataset 2") # Display the pie chart plt.show() .. parsed-literal:: Dataset 2 - DM Type 1 .. image:: diabetes_statistics_files/diabetes_statistics_7_1.png Now Type 2 for dataset 2: .. code:: python # Specify the dataset being analyzed print("\nDataset 2 - DM Type 2") # Define the labels for the pie chart labels = 'Care profile 2', 'Care profile 3', 'Care profile 4' # Define the sizes of each slice of the pie chart sizes = [sum((dataset_2["DM Type"]==2) & (dataset_2["zorgprofiel"]==2)), sum((dataset_2["DM Type"]==2) & (dataset_2["zorgprofiel"]==3)), sum((dataset_2["DM Type"]==2) & (dataset_2["zorgprofiel"]==4))] # Specify which slice (if any) to separate from the rest of the chart explode = (0, 0, 0) # Create a figure with a single subplot and plot the pie chart fig1, ax1 = plt.subplots() patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.setp(pcts, color='white', fontweight=600) # Ensure that the pie chart is drawn as a circle ax1.axis('equal') # Set the title of the pie chart plt.title("Distribution of care profiles for DM Type 2 in Dataset 2") # Display the pie chart plt.show() .. parsed-literal:: Dataset 2 - DM Type 2 .. image:: diabetes_statistics_files/diabetes_statistics_8_1.png Next, DM Type 1 for dataset 3: .. code:: python # Specify the dataset being analyzed print("\nDataset 3 - DM Type 1") # Define the labels for the pie chart labels = 'Care profile 2', 'Care profile 3', 'Care profile 4' # Define the sizes of each slice of the pie chart sizes = [sum((dataset_3["DM Type"]==1) & (dataset_3["zorgprofiel"]==2)), sum((dataset_3["DM Type"]==1) & (dataset_3["zorgprofiel"]==3)), sum((dataset_3["DM Type"]==1) & (dataset_3["zorgprofiel"]==4))] # Specify which slice (if any) to separate from the rest of the chart explode = (0, 0, 0) # Create a figure with a single subplot and plot the pie chart fig1, ax1 = plt.subplots() patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.setp(pcts, color='white', fontweight=600) # Ensure that the pie chart is drawn as a circle ax1.axis('equal') # Set the title of the pie chart plt.title("Distribution of care profiles for DM Type 1 in Dataset 3") # Display the pie chart plt.show() .. parsed-literal:: Dataset 3 - DM Type 1 .. image:: diabetes_statistics_files/diabetes_statistics_9_1.png Finally, for DM Type 2 in dataset 3: .. code:: python # Specify the dataset being analyzed print("\nDataset 3 - DM Type 2") # Define the labels for the pie chart labels = 'Care profile 2', 'Care profile 3', 'Care profile 4' # Define the sizes of each slice of the pie chart sizes = [sum((dataset_3["DM Type"]==2) & (dataset_3["zorgprofiel"]==2)), sum((dataset_3["DM Type"]==2) & (dataset_3["zorgprofiel"]==3)), sum((dataset_3["DM Type"]==2) & (dataset_3["zorgprofiel"]==4))] # Specify which slice (if any) to separate from the rest of the chart explode = (0, 0, 0) # Create a figure with a single subplot and plot the pie chart fig1, ax1 = plt.subplots() patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.setp(pcts, color='white', fontweight=600) # Ensure that the pie chart is drawn as a circle ax1.axis('equal') # Set the title of the pie chart plt.title("Distribution of care profiles for DM Type 2 in Dataset 3") # Display the pie chart plt.show() .. parsed-literal:: Dataset 3 - DM Type 2 .. image:: diabetes_statistics_files/diabetes_statistics_10_1.png Step 3: Looking Deeper Into the Data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the next part of this tutorial, we will analyze the datasets and get and see how knowing the shape of our data is fundamental for our analysis and the importance of correctly cleaning and labelling Let's say we want to perform a similar as before, but focusing on the gender of the participants instead of diabetes type. First, we explore the ``M/V`` column (for the Dutch `Man/Vrouw` for `Man/Woman`). .. code:: python # Define table headers headers= ['Dataset', '% Men', '% Women'] # Calculate the relative proportions of mean and women in each dataset table = [["Dataset 1",(sum(dataset_1["M/V"]==0)/len(dataset_1)*100),(sum(dataset_1["M/V"]==1)/len(dataset_1)*100)],["Dataset 2",(sum(dataset_2["M/V"]==0)/len(dataset_2)*100),(sum(dataset_2["M/V"]==1)/len(dataset_2)*100)],["Dataset 3",(sum(dataset_3["M/V"]==0)/len(dataset_3)*100),(sum(dataset_3["M/V"]==1)/len(dataset_3)*100)]] # Print the table with the calculated proportions using the tabulate library >>> print(tabulate(table, headers=headers, tablefmt='psql')) +-----------+---------+-----------+ | Dataset | % Men | % Women | |-----------+---------+-----------| | Dataset 1 | 53 | 47 | | Dataset 2 | 37 | 33 | | Dataset 3 | 42 | 58 | +-----------+---------+-----------+ This result is strange, we are computing the percentage of men and women, yet the percentages in Dataset 2 do not add up to one hundred. There is something unusual in the database for that source. As we do not have direct access to the data, we can try to see if we have similar data. using the same structure of queries, let's see if the database has some values that are close, like ``-1`` or ``2``. .. code:: python # Define table headers headers= ['0', '1', '-1','2'] # Count the number of entries of each type table = [[(sum(dataset_2["M/V"]==0)),(sum(dataset_2["M/V"]==1)),(sum(dataset_2["M/V"]==-1)),(sum(dataset_2["M/V"]==2))]] # Print the table with the calculated proportions using the tabulate library >>> print(tabulate(table, headers=headers, tablefmt='psql')) +-----+-----+------+-----+ | 0 | 1 | -1 | 2 | |-----+-----+------+-----| | 37 | 33 | 0 | 30 | +-----+-----+------+-----+ We got lucky here, we found that the remaining rows have the value ``2``. Of course, in order to know what this means we must contact the data owner to explain the meaning of the data. In this case, we know that Hospital 2 marked whether the information was filled out by the caregiver in the ``M/V``. [1]_ This means that we have lost data in that case. We can still filter using only the values for zero and one. Now let's create a pie chart to visualize the relative distribution of care profiles for men in dataset 1. .. code:: python print("\n\n\nQuestion 4: What is the relative distribution of care profiles within a certain gender \n") print("Dataset 1 - Men") # Define the labels for the pie chart labels = 'Care profile 2', 'Care profile 3', 'Care profile 4' # Define the sizes of each slice in the pie chart sizes = [sum((dataset_1["M/V"]==0) & (dataset_1["zorgprofiel"]==2)), sum((dataset_1["M/V"]==0) & (dataset_1["zorgprofiel"]==3)), sum((dataset_1["M/V"]==0) & (dataset_1["zorgprofiel"]==4))] # Define the amount of "explode" for each slice (i.e. how far apart it should be from the center) explode = (0, 0, 0) # Create a new figure and axes for the pie chart fig1, ax1 = plt.subplots() # Create the pie chart using the defined parameters patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.setp(pcts, color='white', fontweight=600) # Set the aspect ratio of the axes to be equal so that the pie chart is circular ax1.axis('equal') # Add a title to the pie chart plt.title("Distribution of care profiles for Men in Dataset 1") # Display the pie chart plt.show() .. parsed-literal:: Question 4: What is the relative distribution of care profiles within a certain gender Dataset 1 - Men .. image:: diabetes_statistics_files/diabetes_statistics_12_1.png Next, we can take a look using a pie chart for women in dataset 1: .. code:: python print("\nDataset 1 - Women") # Define the labels for the pie chart labels = 'Care profile 2', 'Care profile 3', 'Care profile 4' # Define the sizes of each slice in the pie chart sizes = [sum((dataset_1["M/V"]==1) & (dataset_1["zorgprofiel"]==2)), sum((dataset_1["M/V"]==1) & (dataset_1["zorgprofiel"]==3)), sum((dataset_1["M/V"]==1) & (dataset_1["zorgprofiel"]==4))] # Define the amount of "explode" for each slice (i.e. how far apart it should be from the center) explode = (0, 0, 0) # Create a new figure and axes for the pie chart fig1, ax1 = plt.subplots() # Create the pie chart using the defined parameters patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.setp(pcts, color='white', fontweight=600) # Set the aspect ratio of the axes to be equal so that the pie chart is circular ax1.axis('equal') # Add a title to the pie chart plt.title("Distribution of care profiles for Women in Dataset 1") # Display the pie chart plt.show() .. parsed-literal:: Dataset 1 - Women .. image:: diabetes_statistics_files/diabetes_statistics_13_1.png Next, for men in dataset 2: .. code:: python print("\nDataset 2 - Men") labels = 'Care profile 2', 'Care profile 3', 'Care profile 4' sizes = [sum((dataset_2["M/V"]==0) & (dataset_2["zorgprofiel"]==2)), sum((dataset_2["M/V"]==0) & (dataset_2["zorgprofiel"]==3)), sum((dataset_2["M/V"]==0) & (dataset_2["zorgprofiel"]==4))] explode = (0, 0, 0) # only "explode" the 2nd slice fig1, ax1 = plt.subplots() patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.setp(pcts, color='white', fontweight=600) ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle. plt.title("Distribution of care profiles for Men in Dataset 2") plt.show() .. parsed-literal:: Dataset 2 - Men .. image:: diabetes_statistics_files/diabetes_statistics_14_1.png Next, the percentage distribution of different care profiles for women in dataset 2: .. code:: python print("\nDataset 2 - Women") # Define the labels for the pie chart labels = 'Care profile 2', 'Care profile 3', 'Care profile 4' # Define the sizes of each slice in the pie chart sizes = [sum((dataset_2["M/V"]==1) & (dataset_2["zorgprofiel"]==2)), sum((dataset_2["M/V"]==1) & (dataset_2["zorgprofiel"]==3)), sum((dataset_2["M/V"]==1) & (dataset_2["zorgprofiel"]==4))] # Define the amount of "explode" for each slice (i.e. how far apart it should be from the center) explode = (0, 0, 0) # Create a new figure and axes for the pie chart fig1, ax1 = plt.subplots() # Create the pie chart using the defined parameters patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.setp(pcts, color='white', fontweight=600) # Set the aspect ratio of the axes to be equal so that the pie chart is circular ax1.axis('equal') # Add a title to the pie chart plt.title("Distribution of care profiles for Women in Dataset 2") # Display the pie chart plt.show() .. parsed-literal:: Dataset 2 - Women .. image:: diabetes_statistics_files/diabetes_statistics_15_1.png The following is an interesting example of properly cleaning and defining the data before uploading it to the VDL Here we will visualize the distribution of care profiles for parents/guardians in dataset 2: .. code:: python # Print title print("\nDataset 2 - Parents/Carers") # Define labels and sizes for pie chart labels = 'Care profile 2', 'Care profile 3', 'Care profile 4' sizes = [sum((dataset_2["M/V"]==2) & (dataset_2["zorgprofiel"]==2)), sum((dataset_2["M/V"]==2) & (dataset_2["zorgprofiel"]==3)), sum((dataset_2["M/V"]==2) & (dataset_2["zorgprofiel"]==4))] # Set explode parameter for pie chart explode = (0, 0, 0) # only "explode" the 2nd slice # Create pie chart with labels, sizes, explode, and formatting parameters fig1, ax1 = plt.subplots() patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.setp(pcts, color='white', fontweight=600) ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle. # Set title for pie chart plt.title("Distribution of care profiles for Parents/Carers in Dataset 2") # Display pie chart plt.show() .. parsed-literal:: Dataset 2 - Parents/Carers .. image:: diabetes_statistics_files/diabetes_statistics_16_1.png Next, for the distribution of care profiles within men in dataset 3: .. code:: python print("\nDataset 3 - Men") labels = 'Care profile 2', 'Care profile 3', 'Care profile 4' sizes = [sum((dataset_3["M/V"]==0) & (dataset_3["zorgprofiel"]==2)), sum((dataset_3["M/V"]==0) & (dataset_3["zorgprofiel"]==3)), sum((dataset_3["M/V"]==0) & (dataset_3["zorgprofiel"]==4))] explode = (0, 0, 0) # only "explode" the 2nd slice fig1, ax1 = plt.subplots() patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.setp(pcts, color='white', fontweight=600) ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle. plt.title("Distribution of care profiles for Men in Dataset 3") plt.show() .. parsed-literal:: Dataset 3 - Men .. image:: diabetes_statistics_files/diabetes_statistics_17_1.png and finally, the distribution of care profiles within women in dataset 3: .. code:: python print("\nDataset 3 - Women") labels = 'Care profile 2', 'Care profile 3', 'Care profile 4' sizes = [sum((dataset_3["M/V"]==1) & (dataset_3["zorgprofiel"]==2)), sum((dataset_3["M/V"]==1) & (dataset_3["zorgprofiel"]==3)), sum((dataset_3["M/V"]==1) & (dataset_3["zorgprofiel"]==4))] explode = (0, 0, 0) # only "explode" the 2nd slice fig1, ax1 = plt.subplots() patches, texts, pcts = ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90) plt.setp(pcts, color='white', fontweight=600) ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle. plt.title("Distribution of care profiles for Women in Dataset 3") plt.show() .. parsed-literal:: Dataset 3 - Women .. image:: diabetes_statistics_files/diabetes_statistics_18_1.png To end everything, let's calculate and display the average age and standard deviation for each of the 3 datasets - and display it in a table: .. code:: python # Print question header print("\n\n\nQuestion 5: What is the mean 'Age' and standard deviation of the sample? \n") # Define column headers for table headers = ["Dataset", "Mean age", "σ age"] # Create table as a list of lists with dataset names, mean ages, and standard deviations table = [["Dataset 1", dataset_1['Leeftijd'].mean(), math.sqrt(dataset_1['Leeftijd'].var())], ["Dataset 2", dataset_2['Leeftijd'].mean(), math.sqrt(dataset_2['Leeftijd'].var())], ["Dataset 3", dataset_3['Leeftijd'].mean(), math.sqrt(dataset_3['Leeftijd'].var())]] # Format and display table using tabulate library print(tabulate(table, headers=headers, tablefmt='psql')) .. parsed-literal:: Question 5: What is the mean 'Age' and standard deviation of the sample? +-----------+-----------------------+--------------+ | Dataset | Mean age | σ age | |-----------+-----------------------+--------------| | Dataset 1 | 50.53 | 26.8883 | | Dataset 2 | 56.69 | 27.1337 | | Dataset 3 | 53.73 | 27.3912 | +-----------+-----------------------+--------------+ .. [1] Why? We don't know, unfortunately those things happen.