.. _categorical: Categorical data ################ While crandas does not have the same typed support as pandas for categorical data, it is robust enough to allow some of the features of this data. Categorical data takes only a limited fixed number of values and is mostly used for aggregations. This functionality will be shown in :ref:`groupby`. Quantization ============== Sometimes, instead of working with numerical values we might prefer to use ranges or bins. This is another type of categorical values. Crandas allows the conversion of integer columns into these ranges, with the :func:`.crandas.cut` function. .. code:: python import crandas as cd # Create a CDataFrame for illustration purposes df = cd.DataFrame({ 'id': cd.Series([1,2,3,4,5,6,7,8,9,10]), 'age': cd.Series([54,32,12,54,23,64,8,43,76,43])}) Let's start by assigning categories to different ranges of values by creating bins. In this case, we will have five categories: 0-20, 20-40, 40-60, 60-80, and 80+. To achieve this, we will define four bins with the following upper limits: 20, 40, 60, and 80. We will then assign each category a label, which should be an integer (see note below). Since we have five categories, we will need five labels. Then we can use :func:`.crandas.cut` to make a new column in the :class:`.CDataFrame` called ``categorized``. .. code:: python # Define our categories (so 0-20, 20-40, 40-60, 60-80, 80+) bins = [20, 40, 60, 80] # Labels must be integers (you should always have one more label than bins) labels = [1,2,3,4,5] # Add a new column called categorized using the 'cut' function df = df.assign(categorized=cd.cut(df['age'], bins=bins, labels=labels, add_inf=True, right=False)) .. note:: - ``add_inf`` refers to whether or not positive/negative infinity are included in the bins. If this is set to False you need to add ``-np.inf`` and ``np.inf`` to the bins. - ``right`` refers to whether or not the right edge should be inclusive or not (``<=`` rather than ``<`` if true). Why integers? ----------------- You might be wondering why we limit labels to integers in the previous function. Strings are more expressive and easier to understand. There are two different answers to this question: - The first one is that using integers allows us to add an *order* to our categories. While not all categorical data is ordered, some of it is (and the one that comes from quantization definitely is). In many cases, computations may require this order to work correctly. - The second reason is more practical, currently aggregation functions that use :ref:`groupings` only work on integers. Categorical data and aggregation often go hand-in-hand, so we want to ensure maximum functionality.