Categorical data
While crandas does not have the same typed support as pandas for categorical data, it is robust enough to allow some of the features of this data. Categorical data takes only a limited fixed number of values and is mostly used for aggregations, as seen inGroup by.
Quantization
Sometimes, instead of working with numerical values we might prefer to
use ranges or bins. This is another type of categorical values. Crandas
allows the conversion of numeric columns into these ranges, with the
cut() function.
import crandas as cd
# Create a DataFrame for illustration purposes
df = cd.DataFrame({
'id': cd.Series([1,2,3,4,5,6,7,8,9,10]),
'age': cd.Series([54,32,12,54,23,64,8,43,76,43])})
Let's start by assigning categories to different ranges of values by
creating bins. In this case, we will have five categories: 0-20, 20-40,
40-60, 60-80, and 80+. To achieve this, we will define four bins with
the following upper limits: 20, 40, 60, and 80. We will then assign each
category a label, which should be a numeric value (i.e., an integer or a
fixed-point, see note below).
Since we have five categories, we will need five labels. Then we can use
cut() to make a new column in
the DataFrame called categorized.
# Define our categories (so 0-20, 20-40, 40-60, 60-80, 80+)
bins = [20, 40, 60, 80]
# Labels must be numerics (you should always have one more label than bins)
labels = [1,2,3,4,5]
# Add a new column called categorized using the 'cut' function
df["categorized"] = cd.cut(
df["age"], bins=bins, labels=labels, add_inf=True, right=False
)
Note
add_infrefers to whether or not positive/negative infinity are included in the bins. If this is set to False you need to add-np.infandnp.infto the bins.rightrefers to whether or not the right edge should be inclusive or not (<=rather than<if true).
Why numerics?
You might be wondering why we limit labels to numerics in the previous function. Strings are more expressive and easier to understand. Using numerics allows us to add an order to our categories. While not all categorical data is ordered, some of it is (and the one that comes from quantization definitely is). In many cases, computations may require this order to work correctly.