Combining data

crandas not only allows to do computations over secret data, but it also allows combining data from multiple sources. If we have data from many sources that is the same, like medical data from different hospitals, we can concatenate them to combine datapoints from multiple sources. We can also take data with a common attribute and join them, complementing the information from different scopes.

Concatenating tables

To show how we can enable data collaboration, we will assume that we performed a lifestyle survey in half of the country. A different company performed the same survey for the other half. We both want to gain insights about the whole country, without letting the other company have access to our datapoints. For that, we will upload both tables to the Virtual Data Lake and concatenate them with crandas.

As the data owners of the survey, we can see the data in the clear and operate on it- which is why we are able to use the head() function below to examine the first few rows of our dataset.

We read in the data from the lifestyle survey using the read_csv() function from pandas. Then we can upload this pandas dataframe to the Virtual Data Lake.

survey_clear = pd.read_csv("/gs_data/health_data_survey.csv")
our_survey = cd.upload_pandas_dataframe(survey_clear)

We can see the first rows of our survey using the head() function of survey_clear. We could not do the same for our_survey as that one has been uploaded to the Virtual Data Lake.

>>> survey_clear.head()
    patient_id     Sex  Age   Country  SleepTime Smoking  AlcoholDrinking
0    99996928   Female   54     Spain          7      No                0
1    99989371     Male   44     Spain          8     Yes                0
2    99989186   Female   44     Italy          6     Yes                0
3    99987062     Male   22  Portugal          5      No                0
4    99986378     Male   58     Spain          5      No                0

Next, we need access to the database of the other survey. Because it belongs to another party, we can only gain access to it after the data owner sends us the handle of the uploaded table. Then we can access the table through get_table(). As this is only an example, we will upload it as well (essentially you would usually use handles in order to access data that we do not own). It is also possible to specify a name for the table in upload_pandas_dataframe() (using name= "chosen_name" as a parameter) in order to use more meaningful and easier to use names.

their_survey = cd.read_csv("gs_data/health_their_survey.csv")

Similar to the first section where we looked at reading and writing data, we can implicitly use the repr() function by simply writing the name of the CDataFrame to obtain the handle for the table, along with the Size and column details, like name and data type (i for integer and s for string).

>>> their_survey
Name: 047E616AD321B7E30C4A2322D6AA0C38CCA264D10F7CF1BB01576D970806B47F
Size: 8943 rows x 7 columns
CIndex([Col("patient_id", "i", 1), Col("Sex", "s", 7), Col("Age", "i", 1), Col("Country", "s", 9), Col("SleepTime", "i", 1), Col("Smoking", "s", 4), Col("AlcoholDrinking", "i", 1)])

We can compare it to the data from our survey to see if the columns match:

>>> our_survey
Name: EFC8A915781AE712856CA7F88AAED560A92D154BA6F78B90DCE36B287B9BC2AA
Size: 15142 rows x 7 columns
CIndex([Col("patient_id", "i", 1), Col("Sex", "s", 7), Col("Age", "i", 1), Col("Country", "s", 9), Col("SleepTime", "i", 1), Col("Smoking", "s", 4), Col("AlcoholDrinking", "i", 1)])

Thankfully the columns seem to match, so we know we can go forward concatenating the two tables.

survey = cd.concat([our_survey,their_survey])

We can check the structure of the survey table we created and see that it hold the records of both tables:

>>> print(repr(survey))
Name: 180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E
Size: 24085 rows x 7 columns
CIndex([Col("patient_id", "i", 1), Col("Sex", "s", 7), Col("Age", "i", 1), Col("Country", "s", 9), Col("SleepTime", "i", 1), Col("Smoking", "s", 4), Col("AlcoholDrinking", "i", 1)])

We have combined two data sources that represented the same thing, but with different datapoints. Next, we will see how to combine different data sources that have a common key.

Joining tables

To show the other way crandas can allow for data collaboration, we will combine the survey data from the previous section with sensitive medical information from a different party.

We are given the name of the table (“medical_data”) by the other party and we load it to our environment in the following way:

medical = cd.get_table("medical_data")

We know that patientnr in the medical` table corresponds to patient_id in the survey table - so we will join the columns on these values. This will be done using the merge() command, where we first list the two dataframes we wish to join and then the type of join we want to execute (inner join, as we only want to return those that have a match in both datasets). Lastly, we refer to the dataset listed first as left and the one second as right, and we provide the variable names in each dataset to join on.

merged = cd.merge(survey, medical, how="inner", left_on="patient_id", right_on="patientnr")

As we did an inner join, we can check the size of the intersection. The function len() in this case returns the number of rows in the merged dataset – 11,038 rows.

>>> print(len(merged))
11038

We can now analyze key aspects of the aggregate data to derive useful insights. Assume we consider the average age of patients that have had a stroke:

We can achieve this with a ‘if else’ conditional statement. In this example, we sum the elements that satisfy the criterion of stroke being equal to 1 (i.e. they had a stroke); if this is true, we include the age value in the sum; if it is not true, it is equal to 0 (and so has no effect on the total age). We can easily filter for this data in the following way:

strokes_filter = merged["Stroke"] == 1

total_age = sum(merged[strokes_filter]["Age"])
total_strokes = sum((merged["Stroke"]==1).if_else(1, 0))
>>> print("The average age of someone who had a stroke is", total_age/total_strokes)
The average age of someone who had a stroke is 57.02721088435374

There are many other ways to do this though. It could also have been done with if_else instead of nesting filters. Here we can calculate the total age by using sum() with the condition that stroke is equal to 1, and for total strokes we can just use sum() on the column stroke:

total_age2 = sum((merged["Stroke"]==1).if_else(merged["Age"], 0))
total_strokes2 = sum((merged["Stroke"]==1).if_else(1, 0))
>>> print("The average age of someone who had a stroke is still", total_age2/total_strokes2)
The average age of someone who had a stroke is still 57.02721088435374

Again, we come to the same conclusion, that the average age that someone has a stroke based on this data is just over 57 years old, which is in line with the previous method of calculation.

A more complex way to do this, which might be useful if we want to perform subsequent analysis over the data is to create a new column. The new column will contain the age of only those patients that had a stroke and a 0 otherwise. This new column will behave in the same way as the original table and its contents will be secret.

We create a new column stroke_age that consists of Age multiplied by the binary variable Stroke (0 if no and 1 if yes).

merged = merged.assign(stroke_age=merged["Age"]*merged["Stroke"])

Note that simply using the mean function on the new column will provide us with an obviously incorrect answer – this is due to the fact that the value in the column for an individual that hasn`t had a stroke will be 0. So, there are many zero-values that would be included in that mean, leading to a skewed mean.

>>> print("The average age of someone who had a stroke is probably not", merged["stroke_age"].mean())
The average age of someone who had a stroke is probably not 2.2784018843993477

As we expected, the result is inconsistent with the previous attempts. This can be fixed by computing the total age of the column and then calculating the number of people that actually had a stroke. Then we can do a simple division of total_age divided by total_strokes in order to get an accurate mean.

For total_age, we can calculate the sum of stroke_age to get the total age. For total_strokes we can count the number of people who had a stroke.

total_age = merged["stroke_age"].sum()
total_strokes = sum(merged["stroke_age"] != 0)
>>> print("The average age of someone who had a stroke is", total_age/total_strokes)
The average age of someone who had a stroke is 57.02721088435374

Note that we are using two functions called sum() but each does a different thing. The former (merged["stroke_age"].sum()), which is a function of the column merged["stroke_age"], sums the values of the column. The latter (sum(merged["stroke_age"] != 0)) sums the number of table rows that fulfill the condition. Alternatively we could have also used merged[“Stroke”].sum() to get to the same number for total_strokes.

Warning

It is very important to maintain good data hygiene, especially when dealing with data in the VDL, as there is no easy way to check its correctness. Data should be formatted correctly in order for analyses to work as we want them to. For example, there are multiple ways to represent dates and one must make sure that the same format is used when concatenating or joining tables. It might be easy to spot that 2020-03-01, 01-03-2020 and 2020/03/01 all refer to the same date, but there is no way to see this similarity once the values are in the Virtual Data Lake.

Working with thresholds

The VDL offers many ways to extract information from uploaded datasets, and it is the responsibility of the analyst to ensure that the performed queries do not reveal information that is too sensitive to reveal. As we have seen, operations like sum and mean reveal data about a column. Similarly, join operations on multiple tables may cause a situation where the resulting table only contains a few individual records, which may lead to unwanted disclosure of sensitive information of the entities those records pertain to. This is why many functions in crandas (such as CDataFrame.filter()) have a threshold parameter implemented. Only when the number of rows on which the function is applied is higher than the specified threshold, the function will be executed. Organizations may choose themselves which threshold is acceptable.

Suppose we have two tables occuring in the VDL: one containing medical data about patients and another containing survey data about patients. These datasets have a few records in common, which can be accessed by performing a join on patient_id.

survey_data_handle = "253A02627B55BBAC4C5B1E3370D82E3F9123E23E4021442A385AB5B83025D7F5"
medical_data_handle = "CAEFF085F0323E9137776E16898806DFE684B7E4C1BDC4963D3A4824A5A70FB9"

We can now perform a join on both datasets:

>>> joined_table = cd.merge(cd.get_table(survey_data_handle), cd.get_table(medical_data_handle), how="inner", left_on='patient_id', right_on='patientnr')
Handle: 76183E482231AC32B90F592D1069E4AF9D31AE31B130B543F4A80585736F0068
Size: 507 rows x 17 columns
CIndex([Col("patient_id", "i", 1), Col("Sex", "s", 7), Col("Age", "i", 1), Col("SleepTime", "i", 1), Col("Smoking", "s", 4), Col("AlcoholDrinking", "i", 1), Col("Country", "s", 15), Col("HeartDisease", "s", 4), Col("BMI", "i", 1), Col("Stroke", "i", 1), Col("PhysicalActivity", "i", 1), Col("GenHealth", "s", 10), Col("DiffWalking", "s", 4), Col("Diabetic", "s", 23), Col("Asthma", "i", 1), Col("KidneyDisease", "i", 1), Col("SkinCancer", "i", 1)])

Suppose we want to analyse the average BMI of individuals having diabetes. However, it might be the case that there are only a few individuals having diabetes, and then the results of the analysis will reveal sensitive properties of those individuals. To prevent this, we allow the table of diabetic people to be generated only when enough people occur in it.

diabetic_filter = joined_table["Diabetic"] == "Yes"
diabetic_filtered = joined_table[diabetic_filter.with_threshold(5)]

No error occurred, so apparently the resulting table contains more than 5 records. We are now allowed to proceed with our analysis.

>>> diabetic_filtered["BMI"].mean()
31.13953488372093

Now say that we want to know the average BMI of people having diabetes and also kidney disease. This time we use a threshold of 2, meaning that the operation is not allowed in case it applies to a single record. This time, we apply the threshold function not on the filter, but on the aggregation (mean()) directly.

>>> diabetic_filtered[diabetic_filtered["KidneyDisease"]==1]["BMI"].mean(threshold=2)
ServerError:  Table does not meet rows threshold for function application (error code: ServerErrorCode.CLIENT_ERROR_THRESHOLD_NOT_MET)

This throws a threshold not met error, so indeed the result of this computation would reveal the BMI of the single individual having both kidney disease and diabetes. Note that the threshold function can also be applied on other aggregation functions, e.g.:

>>> diabetic_filtered["Age"].sum(threshold=2)
168
>>> diabetic_filtered["Age"].min(threshold=2)
34
>>> diabetic_filtered["Age"].count(threshold=2)
3