Combining data ############### crandas not only allows to do computations over secret data, but it also allows combining data from multiple sources. If we have data from many sources that is the same, like medical data from different hospitals, we can concatenate them to combine datapoints from multiple sources. We can also take data with a common attribute and join them, complementing the information from different scopes. Concatenating tables ============================ To show how we can enable data collaboration, we will assume that we performed a lifestyle survey in half of the country. A different company performed the same survey for the other half. We both want to gain insights about the whole country, without letting the other company have access to our datapoints. For that, we will upload both tables to the Virtual Data Lake and concatenate them with crandas. As the data owners of the survey, we can see the data in the clear and operate on it- which is why we are able to use the :code:`head()` function below to examine the first few rows of our dataset. We read in the data from the lifestyle survey using the ``read_csv()`` function from pandas. Then we can upload this pandas dataframe to the Virtual Data Lake. .. code:: python survey_clear = pd.read_csv("/gs_data/health_data_survey.csv") our_survey = cd.upload_pandas_dataframe(survey_clear) We can see the first rows of our survey using the ``head()`` function of ``survey_clear``. We could not do the same for ``our_survey`` as that one has been uploaded to the Virtual Data Lake. >>> survey_clear.head() patient_id Sex Age Country SleepTime Smoking AlcoholDrinking 0 99996928 Female 54 Spain 7 No 0 1 99989371 Male 44 Spain 8 Yes 0 2 99989186 Female 44 Italy 6 Yes 0 3 99987062 Male 22 Portugal 5 No 0 4 99986378 Male 58 Spain 5 No 0 Next, we need access to the database of the other survey. Because it belongs to another party, we can only gain access to it after the data owner sends us the handle of the uploaded table. Then we can access the table through :func:`.get_table`. As this is only an example, we will upload it as well (essentially you would usually use handles in order to access data that we do not own). It is also possible to specify a name for the table in :func:`.upload_pandas_dataframe` (using :code:`name= "chosen_name"` as a parameter) in order to use more meaningful and easier to use names. .. code:: python their_survey = cd.read_csv("gs_data/health_their_survey.csv") Similar to the first section where we looked at reading and writing data, we can implicitly use the ``repr()`` function by simply writing the name of the :class:`.CDataFrame` to obtain the handle for the table, along with the ``Size`` and column details, like name and data type (``i`` for integer and ``s`` for string). >>> their_survey Name: 047E616AD321B7E30C4A2322D6AA0C38CCA264D10F7CF1BB01576D970806B47F Size: 8943 rows x 7 columns CIndex([Col("patient_id", "i", 1), Col("Sex", "s", 7), Col("Age", "i", 1), Col("Country", "s", 9), Col("SleepTime", "i", 1), Col("Smoking", "s", 4), Col("AlcoholDrinking", "i", 1)]) We can compare it to the data from our survey to see if the columns match: >>> our_survey Name: EFC8A915781AE712856CA7F88AAED560A92D154BA6F78B90DCE36B287B9BC2AA Size: 15142 rows x 7 columns CIndex([Col("patient_id", "i", 1), Col("Sex", "s", 7), Col("Age", "i", 1), Col("Country", "s", 9), Col("SleepTime", "i", 1), Col("Smoking", "s", 4), Col("AlcoholDrinking", "i", 1)]) Thankfully the columns seem to match, so we know we can go forward concatenating the two tables. .. code:: python survey = cd.concat([our_survey,their_survey]) We can check the structure of the ``survey`` table we created and see that it hold the records of both tables: >>> print(repr(survey)) Name: 180A662EF5BB52396D31516E9127FDE298A1B6A76673A00BD215ACBA9853FD8E Size: 24085 rows x 7 columns CIndex([Col("patient_id", "i", 1), Col("Sex", "s", 7), Col("Age", "i", 1), Col("Country", "s", 9), Col("SleepTime", "i", 1), Col("Smoking", "s", 4), Col("AlcoholDrinking", "i", 1)]) We have combined two data sources that represented the same thing, but with different datapoints. Next, we will see how to combine different data sources that have a common key. Joining tables ============================ To show the other way crandas can allow for data collaboration, we will combine the survey data from the previous section with sensitive medical information from a different party. We are given the name of the table ("medical_data") by the other party and we load it to our environment in the following way: .. code:: python medical = cd.get_table("medical_data") We know that ``patientnr`` in the ``medical``` table corresponds to ``patient_id`` in the survey table - so we will join the columns on these values. This will be done using the :func:`.merge()` command, where we first list the two dataframes we wish to join and then the type of join we want to execute (inner join, as we only want to return those that have a match in both datasets). Lastly, we refer to the dataset listed first as ``left`` and the one second as ``right``, and we provide the variable names in each dataset to join on. .. code:: python merged = cd.merge(survey, medical, how="inner", left_on="patient_id", right_on="patientnr") As we did an inner join, we can check the size of the intersection. The function ``len()`` in this case returns the number of rows in the merged dataset – 11,038 rows. >>> print(len(merged)) 11038 We can now analyze key aspects of the aggregate data to derive useful insights. Assume we consider the average age of patients that have had a stroke: We can achieve this with a 'if else' conditional statement. In this example, we sum the elements that satisfy the criterion of stroke being equal to 1 (i.e. they had a stroke); if this is true, we include the age value in the sum; if it is not true, it is equal to 0 (and so has no effect on the total age). We can easily filter for this data in the following way: .. code:: python strokes_filter = merged["Stroke"] == 1 total_age = sum(merged[strokes_filter]["Age"]) total_strokes = sum((merged["Stroke"]==1).if_else(1, 0)) >>> print("The average age of someone who had a stroke is", total_age/total_strokes) The average age of someone who had a stroke is 57.02721088435374 There are many other ways to do this though. It could also have been done with :meth:`if_else<.CSeries.if_else>` instead of nesting filters. Here we can calculate the total age by using :code:`sum()` with the condition that :code:`stroke` is equal to 1, and for total strokes we can just use :code:`sum()` on the column :code:`stroke`: .. code:: python total_age2 = sum((merged["Stroke"]==1).if_else(merged["Age"], 0)) total_strokes2 = sum((merged["Stroke"]==1).if_else(1, 0)) >>> print("The average age of someone who had a stroke is still", total_age2/total_strokes2) The average age of someone who had a stroke is still 57.02721088435374 Again, we come to the same conclusion, that the average age that someone has a stroke based on this data is just over 57 years old, which is in line with the previous method of calculation. A more complex way to do this, which might be useful if we want to perform subsequent analysis over the data is to create a new column. The new column will contain the age of only those patients that had a stroke and a 0 otherwise. This new column will behave in the same way as the original table and its contents will be secret. We create a new column ``stroke_age`` that consists of ``Age`` multiplied by the binary variable ``Stroke`` (0 if no and 1 if yes). .. code:: python merged = merged.assign(stroke_age=merged["Age"]*merged["Stroke"]) Note that simply using the ``mean`` function on the new column will provide us with an obviously incorrect answer – this is due to the fact that the value in the column for an individual that hasn`t had a stroke will be 0. So, there are many zero-values that would be included in that mean, leading to a skewed mean. >>> print("The average age of someone who had a stroke is probably not", merged["stroke_age"].mean()) The average age of someone who had a stroke is probably not 2.2784018843993477 As we expected, the result is inconsistent with the previous attempts. This can be fixed by computing the total age of the column and then calculating the number of people that actually had a stroke. Then we can do a simple division of ``total_age`` divided by ``total_strokes`` in order to get an accurate mean. For ``total_age``, we can calculate the sum of ``stroke_age`` to get the total age. For ``total_strokes`` we can count the number of people who had a stroke. .. code:: python total_age = merged["stroke_age"].sum() total_strokes = sum(merged["stroke_age"] != 0) >>> print("The average age of someone who had a stroke is", total_age/total_strokes) The average age of someone who had a stroke is 57.02721088435374 Note that we are using two functions called ``sum()`` but each does a different thing. The former (:meth:`merged["stroke_age"].sum()<.CSeriesColRef.sum()>`), which is a function of the column ``merged["stroke_age"]``, sums the values of the column. The latter (``sum(merged["stroke_age"] != 0)``) sums the number of table rows that fulfill the condition. Alternatively we could have also used ``merged[“Stroke”].sum()`` to get to the same number for ``total_strokes``. .. warning:: It is very important to maintain good data hygiene, especially when dealing with data in the VDL, as there is no easy way to check its correctness. Data should be formatted correctly in order for analyses to work as we want them to. For example, there are multiple ways to represent dates and one must make sure that the same format is used when concatenating or joining tables. It might be easy to spot that ``2020-03-01``, ``01-03-2020`` and ``2020/03/01`` all refer to the same date, but there is no way to see this similarity once the values are in the Virtual Data Lake.