Categorical data¶
While crandas does not have the same typed support as pandas for categorical data, it is robust enough to allow some of the features of this data. Categorical data takes only a limited fixed number of values and is mostly used for aggregations. This functionality will be shown in Group by: split-apply-combine.
Quantization¶
Sometimes, instead of working with numerical values we might prefer to use ranges or bins.
This is another type of categorical values.
Crandas allows the conversion of integer columns into these ranges, with the crandas.cut()
function.
import crandas as cd
# Create a CDataFrame for illustration purposes
df = cd.DataFrame({
'id': cd.Series([1,2,3,4,5,6,7,8,9,10]),
'age': cd.Series([54,32,12,54,23,64,8,43,76,43])})
Let’s start by assigning categories to different ranges of values by creating bins. In this case, we will have five categories: 0-20, 20-40, 40-60, 60-80, and 80+. To achieve this, we will define four bins with the following upper limits: 20, 40, 60, and 80. We will then assign each category a label, which should be an integer (see note below).
Since we have five categories, we will need five labels. Then we can use crandas.cut()
to make a new column in the CDataFrame
called categorized
.
# Define our categories (so 0-20, 20-40, 40-60, 60-80, 80+)
bins = [20, 40, 60, 80]
# Labels must be integers (you should always have one more label than bins)
labels = [1,2,3,4,5]
# Add a new column called categorized using the 'cut' function
df = df.assign(categorized=cd.cut(df['age'], bins=bins, labels=labels, add_inf=True, right=False))
Note
add_inf
refers to whether or not positive/negative infinity are included in the bins. If this is set to False you need to add-np.inf
andnp.inf
to the bins.right
refers to whether or not the right edge should be inclusive or not (<=
rather than<
if true).
Why integers?¶
You might be wondering why we limit labels to integers in the previous function. Strings are more expressive and easier to understand. There are two different answers to this question:
The first one is that using integers allows us to add an order to our categories. While not all categorical data is ordered, some of it is (and the one that comes from quantization definitely is). In many cases, computations may require this order to work correctly.
The second reason is more practical, currently aggregation functions that use groupings only work on integers. Categorical data and aggregation often go hand-in-hand, so we want to ensure maximum functionality.