.. _merge: Merge, join, and concatenate ############################ Crandas is a tool that allows for secure data- collaboration. It allows combining data from multiple sources to do computations. This section will show you exactly that. Here you will learn how to can combine :class:`CDataFrames<.CDataFrame>` in different ways. There are multiple ways to combine tables in crandas, and here we will see them all: - :func:`.crandas.merge` - :func:`.crandas.concat` Concatenating tables ---------------------- The simplest way to concatenate tables is by just stacking them together, either on top of each other or side by side. In order to do this we must be sure that the tables have the same structure, otherwise we will not be able to concatenate them. In order to do this you can use :func:`.crandas.concat`. When you have multiple tables with the same structure that you want to combine (like when you are working with multiple data sources that do the same thing, like for example hospitals), that is when you use vertical concatenation (``axis=0``). .. warning:: The tables you want to join **must** have the same number of columns, the same column names and the same types. It is also important to know that data is formatted in the same way before doing this. .. code:: python import crandas as cd # Creating sample CDataFrames a = cd.DataFrame({"id": [2, 1, 3, 4], "id2": [5, 1, 4, 0]}) b = cd.DataFrame({"id": [3, 0, 1, 2], "id2": [54, 1, 42, 1]}) c = cd.DataFrame({"id3": [3, 0, 1, 2], "id": [54, 1, 42, 1]}) d = cd.DataFrame({"id2": [3, 0, 1, 2], "id": [54, 1, 42, 1]}) e = cd.DataFrame({"id": [3, 0, 1, 2], "id2": [54, 1, 42, 1]}) # Simple vertical concatenation t1 = cd.concat([a, b]) # Vertical concatenation, equivalent to the above t2 = cd.concat([a, b], axis=0) # Concatenate only common columns, ignore index t3 = cd.concat([a, c], join="inner") # Concatenate CDataFrames with columns in a different order t4 = cd.concat([a, d]) # Concatenate multiple CDataFrames at once t4 = cd.concat([a, b, e]) It is also possible to concatenate tables *horizontally*, that is, next to each other. You can do this by using ``axis=1``, **however** this is highly discouraged when working with private data, as it is impossible to check whether the two tables "match". Thankfully, there are smarter ways that we can do that. Merging/Joining :class:`CDataFrames<.CDataFrame>` ------------------------------------------------- In order to properly concatenate the rows of two tables, we need to have a key (or multiple) that identifies which rows should be joined together. When we have this, we can **join** the tables together using :func:`.crandas.merge`. The main way we can join two tables is as an *inner join*. This means that given a key, only rows where that key appears in *both* tables will appear in the joined table. This type of join is very useful as this way we avoid carrying missing values when one value appears in one table and not the other. .. code:: python left_df = cd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Eve"]}) right_df = cd.DataFrame({"id": [4, 2, 1], "salary": [1234, 1111, 98765], "name": ["staple", "horse", "correct"]}) # Inner join merged = cd.merge(left_df, right_df, on="id") >>> print(merged.open()) id name_x salary name_y 0 1 Alice 98765 correct 1 2 Bob 1111 horse The new table combined the two tables on ``id`` regardless of the original order and did not include the values that were not there. Additionally, it added suffixes ``_x`` and ``_y`` to the columns that had a repeated name. Note that the name of the key column names do not need to match, in that case you can use ``left_on`` and ``right_on`` instead of ``on``. Additionally, you can perform an outer join or join using multiple keys. .. code:: python left_df = cd.DataFrame( { "id": [1, 2, 3], "name": ["Alice", "Bob", "Eve"], "good/evil": [1, 1, 0], "color": ["green", "black", "green"], } ) right_df = cd.DataFrame( { "key": [4, 2, 1], "salary": [1234, 1111, 98765], "name": ["staple", "horse", "correct"], "good/evil": [1, 1, 0], "color": ["green", "black", "green"], } ) # Outer join, joining on different named keys merged_outer = cd.merge(left_df, right_df, how="outer", left_on="id", right_on="key") # Inner join over multiple keys merged_mult_keys = cd.merge(left_df, right_df, on=["good/evil", "color"]) .. note:: Currently, joins only work if the keys are unique in *both* tables (except for one exception that we present below). Sometimes, you may wish to join two datasets that have different column names that you want to join on. In this case, you can specify ``left_on`` and ``right_on`` (left referring to the first dataset mentioned and right referring to the second). .. _left join: "Left" join =================== We said before that merging tables requires all keys to be unique. The one exception to this is the following situation: One database consists of purchases by a group of people. The second database contains information about people, where each row corresponds to a person. As a person might make more than one purchase (or none), there will likely be not exactly one row per person in the purchases database. We want to join both tables, which in this case effectively means adding the information of each person to each purchase that that person did. In other words, we want to fill the *left* table with information from the *right* one. This is why we call this a *left join*. [1]_ We are able to do this, but because there are some additional cryptographic implications, it is not as straightforward as before. First we have to create a :class:`crandas.groupby` object in the following way: .. code:: python purchases = cd.DataFrame({"price": [11,22,33,44,55,66,77,88,99], "buyer_id": [1,2,3,3,5,5,5,5,6]}) grouped = purchases.groupby("buyer_id") After doing this we can join it with our people database .. code:: python people = cd.DataFrame({"names": ["Alice", "Bob", "Crow", "Dave", "Eve"], "id": [1,2,3,4,5], "age": [30,32,29,45,55]}) # Note how `left_on` has the grouped object and not the name of the column merged = cd.merge(purchases, people, left_on=grouped, right_on="id", how="left") >>> print(merged.open()) price buyer_id names age 0 11 1 Alice 30 1 22 2 Bob 32 2 33 3 Crow 29 3 44 3 Crow 29 4 55 5 Eve 55 5 66 5 Eve 55 6 77 5 Eve 55 7 88 5 Eve 55 8 99 6 0 This is one of the clearest examples of how crandas is different to pandas. Unfortunately in order to maintain the cryptographic guarantees for the data, we have to diverge how things are done in pandas. However, the fact that we can create such a table without leaking information is worth it! .. warning:: The left join does not fill the missing entries with ``null`` values. Instead, it adds a particular value depending of the datatype. See the above example: missing entries in the varchar column ``names`` in the people database are represented by the empty string ``""``. On the other hand, missing values for the integer column ``age`` are filled by zeros. .. [1] This is a bit of a misnomer, formally speaking this is a *left many-to-one join*.