Merge, join, and concatenate¶

Crandas is a tool that allows for secure data-collaboration. It allows combining data from multiple sources to do computations. This section will show you exactly that. Here you will learn how to can combine DataFrames in different ways.

There are multiple ways to combine tables in crandas, and here we will see them all:

Concatenating tables¶

The simplest way to concatenate tables is by just stacking them together, either on top of each other or side by side. In order to do this we must be sure that the tables have the same structure, otherwise we will not be able to concatenate them. In order to do this you can use crandas.concat().

When you have multiple tables with the same structure that you want to combine (like when you are working with multiple data sources that do the same thing, like for example hospitals), that is when you use vertical concatenation (axis=0).

Warning

The tables you want to join must have the same number of columns, the same column names and the same types. It is also important to know that data is formatted in the same way before doing this.

import crandas as cd

# Creating sample DataFrames
a = cd.DataFrame({"id": [2, 1, 3, 4], "id2": [5, 1, 4, 0]})
b = cd.DataFrame({"id": [3, 0, 1, 2], "id2": [54, 1, 42, 1]})
c = cd.DataFrame({"id3": [3, 0, 1, 2], "id": [54, 1, 42, 1]})
d = cd.DataFrame({"id2": [3, 0, 1, 2], "id": [54, 1, 42, 1]})
e = cd.DataFrame({"id": [3, 0, 1, 2], "id2": [54, 1, 42, 1]})


# Simple vertical concatenation
t1 = cd.concat([a, b])

# Vertical concatenation, equivalent to the above
t2 = cd.concat([a, b], axis=0)

# Concatenate only common columns, ignore index
t3 = cd.concat([a, c], join="inner")

# Concatenate DataFrames with columns in a different order
t4 = cd.concat([a, d])

# Concatenate multiple DataFrames at once
t4 = cd.concat([a, b, e])

It is also possible to concatenate tables horizontally, that is, next to each other. You can do this by using axis=1, however this is highly discouraged when working with private data, as it is impossible to check whether the two tables “match”. Thankfully, there are smarter ways that we can do that.

Merging/Joining `DataFrames`¶

In order to properly concatenate the rows of two tables, we need to have a key (or multiple) that identifies which rows should be joined together. When we have this, we can join the tables together using crandas.merge().

The main way we can join two tables is as an inner join. This means that given a key, only rows where that key appears in both tables will appear in the joined table. This type of join is very useful as this way we avoid carrying missing values when one value appears in one table and not the other.

left_df = cd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Eve"]})
right_df = cd.DataFrame({"id": [4, 2, 1], "salary": [1234, 1111, 98765], "name": ["staple", "horse", "correct"]})

# Inner join
merged = cd.merge(left_df, right_df, on="id")

>>> print(merged.open())
       id name_x  salary  name_y
    0   1  Alice   98765  correct
    1   2    Bob    1111   horse

The new table combined the two tables on id regardless of the original order and did not include the values that were not there. Additionally, it added suffixes _x and _y to the columns that had a repeated name.

Note that the name of the key column names do not need to match, in that case you can use left_on and right_on instead of on. Additionally, you can perform a left, right, or outer join, and/or a join using multiple keys.

left_df = cd.DataFrame(
    {
        "id": [1, 2, 3],
        "name": ["Alice", "Bob", "Eve"],
        "good/evil": [1, 1, 0],
        "color": ["green", "black", "green"],
    }
)
right_df = cd.DataFrame(
    {
        "key": [4, 2, 1],
        "salary": [1234, 1111, 98765],
        "name": ["staple", "horse", "correct"],
        "good/evil": [1, 1, 0],
        "color": ["green", "black", "green"],
    }
)

# Outer join, joining on different named keys
merged_outer = cd.merge(left_df, right_df, how="outer", left_on="id", right_on="key")

# Inner join over multiple keys
merged_mult_keys = cd.merge(left_df, right_df, on=["good/evil", "color"])

Note

The above technique performs a one-to-one join, meaning that the keys need to be unique in both tables. See below for one-to-many and many-to-one joins.

Sometimes, you may wish to join two datasets that have different column names that you want to join on. In this case, you can specify left_on and right_on (left referring to the first dataset mentioned and right referring to the second).

One-to-many and many-to-one joins¶

It is also possible to perform joins when the keys are not unique either in the left table (many-to-one join) or in the right table (one-to-many join).

Consider the following example. One database consists of purchases by a group of people. The second database contains information about people, where each row corresponds to a person. As a person might make more than one purchase (or none), there will likely be not exactly one row per person in the purchases database.

We want to join both tables, which in this case effectively means adding the information of each person to each purchase that that person did. In other words, we want to fill the left table with information from the right one, where the keys in the left table are not unique. This is an example of a left many-to-one join.

We are able to do this, but because there are some additional cryptographic implications, it is not as straightforward as before.

First we have to create a crandas.groupby object in the following way:

purchases = cd.DataFrame({"price": [11,22,33,44,55,66,77,88,99], "buyer_id": [1,2,3,3,5,5,5,5,6]})

grouped = purchases.groupby("buyer_id")

After doing this we can join it with our people database

people = cd.DataFrame({"names": ["Alice", "Bob", "Crow", "Dave", "Eve"], "id": [1,2,3,4,5], "age": [30,32,29,45,55]})

# Note how `left_on` has the grouped object and not the name of the column
merged = cd.merge(purchases, people, left_on=grouped, right_on="id", how="left")

>>> print(merged.open())
       price  buyer_id  names  age
   11         1  Alice   30
   22         2    Bob   32
   33         3   Crow   29
   44         3   Crow   29
   55         5    Eve   55
   66         5    Eve   55
   77         5    Eve   55
   88         5    Eve   55
   99         6           0

This is one of the clearest examples of how crandas is different to pandas. Unfortunately in order to maintain the cryptographic guarantees for the data, we have to diverge how things are done in pandas. However, the fact that we can create such a table without leaking information is worth it!

Crandas supports left, right, inner, and outer joins. It also allows for different matching criteria:

One-to-one: left_on and right_on are both sets of columns from the respective tables

Many-to-one: (as demonstrated above) left_on is obtained by performing a groupby, and right_on is a set of columns

One-to-many: left_on is a set of columns, and right_on is obtained by performing a groupby.

Merge, join, and concatenate¶

Concatenating tables¶

Merging/Joining DataFrames¶

One-to-many and many-to-one joins¶

Merging/Joining `DataFrames`¶