.. _groupby: Group by: split-apply-combine ############################# .. Go through and add the nuances (i.e. currently doesn't work with strings as data type, mention using filtering for mean, etc.) Using the :meth:`.CDataFrame.groupby` functionality, we can partition the rows of a table based on the value of one of the columns, and perform subsequent operations on the resulting parts. The grouping functionality is similar to the pandas library, and it is strongly related to the SQL ``GROUP BY`` clause. The process involves splitting the data into groups, applying an aggregation function to each group, and then combining the results. To group data in crandas, you can use the :meth:`.CDataFrame.groupby` method. This will create a :class:`crandas.groupby` object, which we call a *grouping*. This object determines the way the original table must be transformed to group based on a column. .. warning:: Currently, a maximum of around 100 000 unique values are supported. Above that, the groupby will fail and give an error message. Note that this is the number of *unique* values. The number of rows can be significantly higher as long as there are less than 100 000 different values in the groupby column(s). Aggregation operations ------------------------ Here is a list of aggregation operations that can be performed on grouped data, through a :class:`crandas.groupby` object. - :data:`.crandas.groupby.sum`: Compute the sum of each group (integer columns only). - :data:`.crandas.groupby.max`: Find the maximum value of each group (integer columns only). - :data:`.crandas.groupby.min`: Find the minimum value of each group (integer columns only). - :data:`.crandas.groupby.any`: Return any value from the group. Below is an example that demonstrates how to use these operations with crandas: .. code:: python import numpy as np import pandas as pd import crandas as cd # Create a dataframe using pandas ipl_data = {"Team": ["Riders", "Riders", "Devils", "Devils", "Kings", "kings", "Kings", "Kings", "Riders", "Royals", "Royals", "Riders"], "Rank": [1, 2, 2, 3, 3, 4, 1, 1, 2, 4, 1, 2], "Year": [2014, 2015, 2014, 2015, 2014, 2015, 2016, 2017, 2016, 2014, 2015, 2017], "Points": [876, 789, 863, 673, 741, 812, 756, 788, 694, 701, 804, 690]} df = pd.DataFrame(ipl_data) After creating the dataframe, we are able to use :func:`.crandas.upload_pandas_dataframe` and then group by ``Year`` to execute different operations. .. code:: python # Upload the pandas dataframe to the engine (CDataFrame), then group by year and # perform aggregation operations cdf = cd.upload_pandas_dataframe(df) # This is our grouping object cdgrouped = cdf.groupby("Year") cdsum = cdgrouped["Points"].agg(cd.groupby.sum) cdmax = cdgrouped["Points"].agg(cd.groupby.max) cdmin = cdgrouped["Points"].agg(cd.groupby.min) .. note:: For reference, this is slightly different to how one would use the ``GroupBy`` method in pandas. The difference can be seen below: .. code:: python df = pd.DataFrame(ipl_data) pdgrouped = df.groupby("Year") pdsum = pdgrouped["Points"].agg(np.sum) pdmax = pdgrouped["Points"].agg(np.max) pdmin = pdgrouped["Points"].agg(np.min) .. note:: The :data:`.crandas.groupby.any` aggregation operation returns any column value from the group. There are no guarantees which value this is, or whether this value is the same between runs. Thus functionality is mainly useful if it is already known beforehand that all column values of the group will be the same (for example if the column is obtained via a join with another table). In such cases, :data:`.crandas.groupby.any`, :data:`.crandas.groupby.min`, and :data:`.crandas.groupby.max` will all return the same result, but :data:`.crandas.groupby.any` is much more efficient than the other two. We can also look at the size of each group that we create using the :meth:`.CDataFrame.groupby` method. Below we create a pandas DataFrame which we upload to the engine. We then group on column ``a`` (leading to 3 groups - 1, 2 and 3). We can then use the :meth:`.CDataFrameGroupBy.size` to compute the size of each group. .. code:: python pa = pd.DataFrame({"a": [1, 1, 1, 2, 2, 2, 2, 3, 3]}) ca = cd.upload_pandas_dataframe(pa) cgrouping = ca.groupby("a") csizes = cgrouping.size() >>> csizes a 1 3 2 4 3 2 Many-to-One Joins ------------------- One of the primary uses of the :class:`crandas.groupby` objects in crandas is to allow :ref:`many-to-one joins`. That is, joining two tables on a column whose values are repeated in the first table and unique in the second. Grouping the first table allows us to connect both tables, which is why in that case, we input the grouping instead of a column as a key. Reusing a grouping ------------------- .. versionadded:: 1.5 Re-using a grouping. Occassionally, after performing a grouping you will modify the table that such a grouping belonged to, e.g. computing new columns or performing a join. After doing that, the old :class:`crandas.groupby` will no longer work. .. code:: python tab = cd.DataFrame({"a": [1, 2, 2, 4, 5, 5]}, auto_bounds=True) new_data = cd.DataFrame({"b": [0, 6, 6, 8, 9, 9]}, auto_bounds=True) # make grouping grouping = tab.groupby("a") # edit table tab a = cd.concat([tab, new_data], axis=1) # this does not work, since the grouping was made for the old tab grouping["b"].agg(cd.groupby.sum) However, instead of fully recomputing the grouping, which might take time. We can simply convert the previous one: .. code:: python new_grouping = tab.groupby(grouping) # Now this will work new_grouping["b"].agg(cd.groupby.sum) While the steps to create this new grouping might seem the same, doing it this way is more efficient than computing it from scratch. For this strategy to work, it is important that the order of the rows in the original table is not modified. If all of therows are not in their original place, we would need to create a new grouping from the table. .. code:: python tab = tab.shuffle() # This will not work newer_grouping = tab.groupby(new_grouping) # But this will newest_grouping = tab.groupby("a") Group by on multiple columns ---------------------------- .. versionadded:: 1.9 Group by on multiple columns and columns of any non-nullable type. It is also possible to perform a group by multiple columns. Instead of passing the name of a single column as an argument to :meth:`.CDataFrame.groupby`, a list of column names can be used. The referenced columns can be of any non-nullable type. .. code:: python tab = cd.DataFrame({"a": [1, 2, 2, 2, 3, 3], "b": ["x", "x", "x", "y", "y", "y"]}, auto_bounds=True) # Perform group by on multiple columns, which can be of any non-nullable type grouping = tab.groupby(["a", "b"]) # The grouping object behaves the same as for a single column csizes = cgrouping.size() >>> csizes a b 0 2 x 2 1 1 x 1 2 2 y 1 3 3 y 2