Group by
Using the groupby()
functionality, we can partition the rows of a table based on the value
of one of the columns, and perform subsequent operations on the
resulting parts. The grouping functionality is similar to the pandas
library, and it is strongly related to the SQL GROUP BY clause. The
process involves splitting the data into groups, applying an aggregation
function to each group, and then combining the results.
To group data in crandas, you can use the
DataFrame.groupby() method. This will
create a DataFrameGroupBy object, which
we call a grouping. This object determines the way the original table
must be transformed to group based on a column.
Aggregation operations
Here is a list of aggregation operations that can be performed on
grouped data, through a DataFrameGroupBy
object.
crandas.groupby.sum: Compute the sum of each group (numeric columns only).crandas.groupby.max: Find the maximum value of each group (numeric columns only).crandas.groupby.min: Find the minimum value of each group (numeric columns only).crandas.groupby.mean: Compute the mean of each group (numeric columns only).crandas.groupby.any: Return any value from the group.
Below is an example that demonstrates how to use these operations with crandas:
import numpy as np
import pandas as pd
import crandas as cd
# Create a dataframe using pandas
ipl_data = {
"Team": ["Riders","Riders","Devils","Devils","Kings","kings",
"Kings","Kings","Riders","Royals","Royals","Riders",
],
"Rank": [1, 2, 2, 3, 3, 4, 1, 1, 2, 4, 1, 2],
"Year": [2014, 2015, 2014, 2015, 2014, 2015, 2016, 2017, 2016, 2014, 2015, 2017],
"Points": [876, 789, 863, 673, 741, 812, 756, 788, 694, 701, 804, 690],
}
df = pd.DataFrame(ipl_data)
After creating the dataframe, we are able to use
upload_pandas_dataframe() and
then group by Year to execute different operations.
# Upload the pandas dataframe to the engine then group by year and
# perform aggregation operations
cdf = cd.upload_pandas_dataframe(df)
# This is our grouping object
cdgrouped = cdf.groupby("Year")
cdsum = cdgrouped["Points"].agg(cd.groupby.sum)
cdmax = cdgrouped["Points"].agg(cd.groupby.max)
cdmin = cdgrouped["Points"].agg(cd.groupby.min)
cdmean = cdgrouped["Points"].agg(cd.groupby.mean)
# Alternative syntax
cdsum = cdgrouped["Points"].sum()
cdmax = cdgrouped["Points"].max()
cdmin = cdgrouped["Points"].min()
cdmean = cdgrouped["Points"].mean()
Note
The crandas.groupby.any aggregation operation returns any column value from the group. There are no guarantees which value this is, or whether this value is the same between runs. Thus functionality is mainly useful if it is already known beforehand that all column values of the group will be the same (for example if the column is obtained via a join with another table). In such cases, crandas.groupby.any, crandas.groupby.min, and crandas.groupby.max will all return the same result, but crandas.groupby.any is much more efficient than the other two.
We can also look at the size of each group that we create using the
.groupby() method. Below we
create a pandas DataFrame which we upload to the engine. We then group
on column a (leading to 3 groups - 1, 2 and 3). We can then use the
.size() to compute the
size of each group.
pa = pd.DataFrame({"a": [1, 1, 1, 2, 2, 2, 2, 3, 3]})
ca = cd.upload_pandas_dataframe(pa)
cgrouping = ca.groupby("a")
csizes = cgrouping.size()
Many-to-One Joins
One of the primary uses of the DataFrameGroupBy
objects in crandas is to allow
many-to-one joins.
That is, joining two tables on a column whose values are repeated in the
first table and unique in the second. Grouping the first table allows us
to connect both tables, which is why in that case, we input the grouping
instead of a column as a key.
Reusing a grouping
Occasionally, after performing a grouping you will modify the table
that such a grouping belonged to, e.g. computing new columns or
performing a join. After doing that, the old
DataFrameGroupBy will no longer work.
tab = cd.DataFrame({"a": [1, 2, 2, 4, 5, 5]}, auto_bounds=True)
new_data = cd.DataFrame({"b": [0, 6, 6, 8, 9, 9]}, auto_bounds=True)
# make grouping
grouping = tab.groupby("a")
# edit table tab
a = cd.concat([tab, new_data], axis=1)
# this does not work, since the grouping was made for the old tab
grouping["b"].agg(cd.groupby.sum)
However, instead of fully recomputing the grouping, which might take time. We can simply convert the previous one:
While the steps to create this new grouping might seem the same, doing it this way is more efficient than computing it from scratch.
For this strategy to work, it is important that the order of the rows in the original table is not modified. If all of therows are not in their original place, we would need to create a new grouping from the table.
tab = tab.shuffle()
# This will not work
newer_grouping = tab.groupby(new_grouping)
# But this will
newest_grouping = tab.groupby("a")
Group by on multiple columns
It is also possible to perform a group by multiple columns. Instead of
passing the name of a single column as an argument to
.groupby(), a list of column
names can be used.