Group by: split-apply-combine¶
Using the CDataFrame.groupby()
functionality, we can partition the rows of a table based on the value of one of the columns, and perform subsequent operations on the resulting parts.
The grouping functionality is similar to the Pandas library, and it is strongly related to the SQL GROUP BY
clause.
The process involves splitting the data into groups, applying an aggregation function to each group, and then combining the results.
Note
Currently, only integer columns can be the base of a grouping operation. See Tips and tricks to see how to work with this when you are working with string columns.
To group data in crandas, you can use the CDataFrame.groupby()
method. This will create a crandas.groupby
object, which we call a grouping. This object determines the way the original table must be transformed to group based on a column.
Aggregation operations¶
Here is a list of aggregation operations that can be performed on grouped data, through a crandas.groupby
object.
crandas.groupby.sum
: Compute the sum of each group.crandas.groupby.max
: Find the maximum value of each group.crandas.groupby.min
: Find the minimum value of each group.
Below is an example that demonstrates how to use these operations with crandas:
import numpy as np
import pandas as pd
import crandas as cd
# Create a dataframe using pandas
ipl_data = {"Team": ["Riders", "Riders", "Devils", "Devils", "Kings", "kings", "Kings", "Kings", "Riders", "Royals", "Royals", "Riders"],
"Rank": [1, 2, 2, 3, 3, 4, 1, 1, 2, 4, 1, 2],
"Year": [2014, 2015, 2014, 2015, 2014, 2015, 2016, 2017, 2016, 2014, 2015, 2017],
"Points": [876, 789, 863, 673, 741, 812, 756, 788, 694, 701, 804, 690]}
df = pd.DataFrame(ipl_data)
After creating the dataframe, we are able to use crandas.upload_pandas_dataframe()
and then group by Year
to execute different operations.
# Upload the pandas dataframe to the VDL (CDataFrame), then group by year and
# perform aggregation operations
cdf = cd.upload_pandas_dataframe(df)
# This is our grouping object
cdgrouped = cdf.groupby("Year")
cdsum = cdgrouped["Points"].agg(cd.groupby.sum)
cdmax = cdgrouped["Points"].agg(cd.groupby.max)
cdmin = cdgrouped["Points"].agg(cd.groupby.min)
Note
For reference, this is slightly different to how one would use the GroupBy
method in pandas. The difference can be seen below:
df = pd.DataFrame(ipl_data)
pdgrouped = df.groupby("Year")
pdsum = pdgrouped["Points"].agg(np.sum)
pdmax = pdgrouped["Points"].agg(np.max)
pdmin = pdgrouped["Points"].agg(np.min)
We can also look at the size of each group that we create using the CDataFrame.groupby()
method.
Below we create a pandas dataframe which we upload to the VDL. We then group on column a
(leading to 3 groups - 1, 2 and 3).
We can then use the CDataFrameGroupBy.size()
to compute the size of each group.
pa = pd.DataFrame({"a": [1, 1, 1, 2, 2, 2, 2, 3, 3]})
ca = cd.upload_pandas_dataframe(pa)
cgrouping = ca.groupby("a")
csizes = cgrouping.size()
>>> csizes
a
1 3
2 4
3 2
Many-to-One Joins¶
One of the primary uses of the crandas.groupby
objects in crandas is to allow many-to-one joins. That is, joining two tables on a column whose values are repeated in the first table and unique in the second. Grouping the first table allows us to connect both tables, which is why in that case, we input the grouping instead of a column as a key.
Reusing a grouping¶
Added in version 1.5: Re-using a grouping.
Occassionally, after performing a grouping you will modify the table that such a grouping belonged to, e.g. computing new columns or performing a join. After doing that, the old crandas.groupby
will no longer work.
tab = cd.DataFrame({"a": [1, 2, 2, 4, 5, 5]}, auto_bounds=True)
new_data = cd.DataFrame({"a": [2, 3, 4, 5], "b": [6, 7, 8, 9]}, auto_bounds=True)
# make grouping
grouping = tab.groupby("a")
# edit table tab
tab = cd.merge(tab, new_data, how="left", left_on=grouping, right_on="a")
# this does not work, since the grouping was made for the old tab
grouping["b"].agg(cd.groupby.sum)
However, instead of fully recomputing the grouping, which might take time. We can simply convert the previous one:
new_grouping = tab.groupby(grouping)
# Now this will work
new_grouping["b"].agg(cd.groupby.sum)
While the steps to create this new grouping might seem the same, doing it this way is more efficient than computing it from scratch.
For this strategy to work, it is important that the order of the rows in the original table is not modified. If all of therows are not in their original place, we would need to create a new grouping from the table.
tab = tab.shuffle()
# This will not work
newer_grouping = a.groupby(new_grouping)
# But this will
newest_grouping = a.groupby("a")