Merge, join, and concatenate¶
Crandas is a tool that allows for secure data- collaboration.
It allows combining data from multiple sources to do computations.
This section will show you exactly that.
Here you will learn how to can combine CDataFrames
in different ways.
There are multiple ways to combine tables in crandas, and here we will see them all:
Concatenating tables¶
The simplest way to concatenate tables is by just stacking them together, either on top of each other or side by side.
In order to do this we must be sure that the tables have the same structure, otherwise we will not be able to concatenate them.
In order to do this you can use crandas.concat()
.
When you have multiple tables with the same structure that you want to combine (like when you are working with multiple data sources that do the same thing, like for example hospitals), that is when you use vertical concatenation (axis=0
).
Warning
The tables you want to join must have the same number of columns, the same column names and the same types. It is also important to know that data is formatted in the same way before doing this.
import crandas as cd
# Creating sample CDataFrames
a = cd.DataFrame({"id": [2, 1, 3, 4], "id2": [5, 1, 4, 0]})
b = cd.DataFrame({"id": [3, 0, 1, 2], "id2": [54, 1, 42, 1]})
c = cd.DataFrame({"id3": [3, 0, 1, 2], "id": [54, 1, 42, 1]})
d = cd.DataFrame({"id2": [3, 0, 1, 2], "id": [54, 1, 42, 1]})
e = cd.DataFrame({"id": [3, 0, 1, 2], "id2": [54, 1, 42, 1]})
# Simple vertical concatenation
t1 = cd.concat([a, b])
# Vertical concatenation, equivalent to the above
t2 = cd.concat([a, b], axis=0)
# Concatenate only common columns, ignore index
t3 = cd.concat([a, c], join="inner")
# Concatenate CDataFrames with columns in a different order
t4 = cd.concat([a, d])
# Concatenate multiple CDataFrames at once
t4 = cd.concat([a, b, e])
It is also possible to concatenate tables horizontally, that is, next to each other. You can do this by using axis=1
, however this is highly discouraged when working with private data, as it is impossible to check whether the two tables “match”. Thankfully, there are smarter ways that we can do that.
Merging/Joining CDataFrames
¶
In order to properly concatenate the rows of two tables, we need to have a key (or multiple) that identifies which rows should be joined together. When we have this, we can join the tables together using crandas.merge()
.
The main way we can join two tables is as an inner join. This means that given a key, only rows where that key appears in both tables will appear in the joined table. This type of join is very useful as this way we avoid carrying missing values when one value appears in one table and not the other.
left_df = cd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Eve"]})
right_df = cd.DataFrame({"id": [4, 2, 1], "salary": [1234, 1111, 98765], "name": ["staple", "horse", "correct"]})
# Inner join
merged = cd.merge(left_df, right_df, on="id")
>>> print(merged.open())
id name_x salary name_y
0 1 Alice 98765 correct
1 2 Bob 1111 horse
The new table combined the two tables on id
regardless of the original order and did not include the values that were not there. Additionally, it added suffixes _x
and _y
to the columns that had a repeated name.
Note that the name of the key column names do not need to match, in that case you can use left_on
and right_on
instead of on
. Additionally, you can perform a left, right, or outer join, and/or a join using multiple keys.
left_df = cd.DataFrame(
{
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Eve"],
"good/evil": [1, 1, 0],
"color": ["green", "black", "green"],
}
)
right_df = cd.DataFrame(
{
"key": [4, 2, 1],
"salary": [1234, 1111, 98765],
"name": ["staple", "horse", "correct"],
"good/evil": [1, 1, 0],
"color": ["green", "black", "green"],
}
)
# Outer join, joining on different named keys
merged_outer = cd.merge(left_df, right_df, how="outer", left_on="id", right_on="key")
# Inner join over multiple keys
merged_mult_keys = cd.merge(left_df, right_df, on=["good/evil", "color"])
Note
The above technique performs a one-to-one join, meaning that the keys need to be unique in both tables. See below for one-to-many and many-to-one joins.
Sometimes, you may wish to join two datasets that have different column names that you want to join on. In this case, you can specify left_on
and right_on
(left referring to the first dataset mentioned and right referring to the second).
One-to-many and many-to-one joins¶
It is also possible to perform joins when the keys are not unique either in the left table (many-to-one join) or in the right table (one-to-many join).
Consider the following example. One database consists of purchases by a group of people. The second database contains information about people, where each row corresponds to a person. As a person might make more than one purchase (or none), there will likely be not exactly one row per person in the purchases database.
We want to join both tables, which in this case effectively means adding the information of each person to each purchase that that person did. In other words, we want to fill the left table with information from the right one, where the keys in the left table are not unique. This is an example of a left many-to-one join.
We are able to do this, but because there are some additional cryptographic implications, it is not as straightforward as before.
First we have to create a crandas.groupby
object in the following way:
purchases = cd.DataFrame({"price": [11,22,33,44,55,66,77,88,99], "buyer_id": [1,2,3,3,5,5,5,5,6]})
grouped = purchases.groupby("buyer_id")
After doing this we can join it with our people database
people = cd.DataFrame({"names": ["Alice", "Bob", "Crow", "Dave", "Eve"], "id": [1,2,3,4,5], "age": [30,32,29,45,55]})
# Note how `left_on` has the grouped object and not the name of the column
merged = cd.merge(purchases, people, left_on=grouped, right_on="id", how="left")
>>> print(merged.open())
price buyer_id names age
0 11 1 Alice 30
1 22 2 Bob 32
2 33 3 Crow 29
3 44 3 Crow 29
4 55 5 Eve 55
5 66 5 Eve 55
6 77 5 Eve 55
7 88 5 Eve 55
8 99 6 0
This is one of the clearest examples of how crandas is different to pandas. Unfortunately in order to maintain the cryptographic guarantees for the data, we have to diverge how things are done in pandas. However, the fact that we can create such a table without leaking information is worth it!
Crandas supports left, right, inner, and outer joins, either as one-to-one joins (left_on and right_on are both sets of columns from the respective tables); as many-to-one joins as demonstrated above (left_on is obtained by performing a groupby, and right_on is a set of columns); or as one-to-many joins (left_on is a set of columns, and right_on is obtained by performing a groupby).