Categorical data

While crandas does not have the same typed support as pandas for categorical data, it is robust enough to allow some of the features of this data. Categorical data takes only a limited fixed number of values and is mostly used for aggregations. This functionality will be shown in Group by: split-apply-combine.

Quantization

Sometimes, instead of working with numerical values we might prefer to use ranges or bins. This is another type of categorical values. Crandas allows the conversion of numeric columns into these ranges, with the crandas.cut() function.

import crandas as cd

# Create a CDataFrame for illustration purposes
df = cd.DataFrame({
'id': cd.Series([1,2,3,4,5,6,7,8,9,10]),
'age': cd.Series([54,32,12,54,23,64,8,43,76,43])})

Let’s start by assigning categories to different ranges of values by creating bins. In this case, we will have five categories: 0-20, 20-40, 40-60, 60-80, and 80+. To achieve this, we will define four bins with the following upper limits: 20, 40, 60, and 80. We will then assign each category a label, which should be a numeric value (i.e., an integer or a fixed-point, see note below).

Since we have five categories, we will need five labels. Then we can use crandas.cut() to make a new column in the CDataFrame called categorized.

# Define our categories (so 0-20, 20-40, 40-60, 60-80, 80+)
bins = [20, 40, 60, 80]

# Labels must be numerics (you should always have one more label than bins)
labels = [1,2,3,4,5]

# Add a new column called categorized using the 'cut' function
df = df.assign(categorized=cd.cut(df['age'], bins=bins, labels=labels, add_inf=True, right=False))

Note

  • add_inf refers to whether or not positive/negative infinity are included in the bins. If this is set to False you need to add -np.inf and np.inf to the bins.

  • right refers to whether or not the right edge should be inclusive or not (<= rather than < if true).

Why numerics?

You might be wondering why we limit labels to numerics in the previous function. Strings are more expressive and easier to understand. Using numerics allows us to add an order to our categories. While not all categorical data is ordered, some of it is (and the one that comes from quantization definitely is). In many cases, computations may require this order to work correctly.

In the next section we will learn how to combine multiple tables. This feature is one of the most important to allow for data sharing, as it will allow us to consolidate data from multiple sources into one.