Basic table operations
This section will cover essential basic functionality in crandas, including inspecting, assigning new columns, computing with columns, boolean logic, and manipulating column names.
Inspecting DataFrames
While you cannot normally access the data stored in a crandas
DataFrame you can use the
.describe() method to inspect a
DataFrame to get summary statistics
of the numeric data.
>>>df.describe()
A B
0 type integer integer
1 count 5 5
2 mean 3.0 -1.0
3 std 1.581139 0.0
4 min 1 -1
5 max 5 -1
See Working with numeric data for details
Assigning new columns
To create a new column or update an existing one, simply assign the
desired value to the target column. Alternatively, the method
.assign() can be used.
df = cd.DataFrame({"col1":[1,2,3,4,5], "col2": [6,7,8,9,10]})
# Create a new column called 'col3' which is equal to 'col1' - 1
# Using direct assignment
df["col3"] = df["col1"] - 1
# Create a 2 new columns: "col4" which is equal to 'col1' + 1
# and "col5" which is equal to "col2" - 1
# Using the assign method, which can assign multiple columns at once
df = df.assign(col4=df["col1"] + 1, col5=lambda x: x.col2 - 1)
>>> print(df.open())
col1 col2 col3 col4 col5
0 1 6 0 2 5
1 2 7 1 3 6
2 3 8 2 4 7
3 4 9 3 5 8
4 5 10 4 6 9
Computing with columns
In the previous section, we created a new column by applying an operation to an existing
column: df["col3"] = df["col1"] - 1.
Here we did an arithmetic operation between a column and a constant but these operations
can also be between two columns: df["col1"]+df["col2"].
In general, they can include any number of columns and constants, like so: 3*df["col1"]+df["col2"]/2.
The columns do not even have to be from the same table, as long as they have the same length.
Warning
Only compute operations between two distinct tables if you are sure that the rows are in the same order. It is ofter better to first join tables before computing operations over the columns.
When doing operations of this sort, we either want to assign them to a column of an
existing table, like in the section before, or if we want to treat the new column
as on its own, using .as_table() to convert
it to a table.
>>> (df["col1"] + 3*df["col2"]).as_table(column_name="operation")
Handle: 43E2C3ECEC05859ECE4A67202E9BDA6ADA9006CB906317852586DDE39D0507B4 (design mode)
operation
ctype int
Size: 5 rows x 1 columns (contents secret-shared; data size: 40 bytes)
If we only want the result of the operation we can simply open it
Arithmetic operations are not the only ones that can be applied to a column. These operations depend on the type of the data and will be discussed in the chapters starting with Working with numeric data. A common operation that we will want to do is comparison.
Given a column, we can check whether its values are equal, greater (or equal) than or less (or equal) than a given value.
df = cd.DataFrame({
"id": [1,2,3,4,5],
"is_member": [0,1,1,0,1],
"height": [175,162,151,160,180]
}
)
#We check who has id=2
id_is2 = df["id"] == 2
#We check which rows hold the information of someone who is 160cm or taller
at_least160 = df["height"] >= 160
Of course, we probably want to do more than just creating this column. We can also filter based on this value (Find out more about filtering in Selecting data):
df = cd.DataFrame({
"id": [1,2,3,4,5],
"is_member": [0,1,1,0,1],
"height": [175,162,151,160,180]
}
)
# Filter the df such that only members are included (using ==)
member = df[df["is_member"] == 1]
# Print the mean member height
print(member['height'].mean())
# For not_member we can change '==' to '!='
not_member = df[df['is_member'] != 1]
# Print the mean not_member height
print(not_member['height'].mean())
Note
You can also use boolean operations over columns, like
(df['is_member'] > 3) & (df['height'] < 160). Crandas supports
and &, or | and xor ^.
Sometimes there might be more complex operations that we wish to make. For example, maybe we want to determine which values in a column are higher than the mean. If we are willing to open the mean, this is very straightforward
However, maybe we do not want to open the mean, as it should be kept secret.
In these cases we use mode="regular" to keep the value closed.
secret_mean = df["height"].mean(mode="regular")
df["above_secret_mean"] = df["height"] > secret_mean
>>> df.open()
id is_member height above_mean above_secret_mean
0 1 0 175 True True
1 2 1 162 False False
2 3 1 151 False False
3 4 0 160 False False
4 5 1 180 True True
Conditionals over columns
Once you are able to check whether a column fulfills a certain property,
you might want to create a new column with a value based on that.
Crandas has multiple ways to make this happen. We can use the
.if_else() method to create
simple conditionals.
df = cd.DataFrame({
"id": [1,2,3,4,5],
"is_member": [0,1,1,0,1],
"height": [175,162,151,160,180]
}
)
# Create a column that shows the cms above 160 in the person's height or -1 if they are shorter
df["cms_above_160"] = (df["height"] >= 160).if_else(df["height"] - 160, -1)
We can use .if_else() to make
categorical columns out of numerical ones, by nesting the function:
# Creates 3 categories for height:
# 0 for below 160, 1 for between 160 and 170 and 2 for above 170
df["height_cats"] = (df["height"] >= 170).if_else(2, (df["height"] >= 160).if_else(1, 0))
Tip
Crandas provides easier ways to create categorical columns, as seen in Categorical data.
We can also use the .where()
method when we want to only keep the value when a condition is met and
leave a null value instead.
# This command returns a column where only the heights of members are shown
# Non-member heights are turned to null values
df["height"].where(df["is_member"]==1)
# If we add an else value, we can fill that in whenever the condition is not met
# Here, the height of non-members is set to 175
df["height"].where(df["is_member"]==1,175)
If we want to use multiple nested conditions, this syntax will quickly
become hard to parse. Crandas also provides when().then() syntax
similar to polars. This syntax can be used to more easily represent
multiple conditionals and values using the function
when():
df["is_it_three"] = (
cd.when(df["height"] > 175)
.then("bigger")
.when(df["height"] < 175)
.then("smaller")
.when(df["height"] == 175)
.then("perfect")
)
# The function `cd.when()`` can be used to add multiple conditions
df["val"] = (
cd.when(
cd.col("member") == 1, # (1)
cd.col("height") > 175, # (2)
)
.then("tall member")
.otherwise(-1) # (3)
)
cd.col()allows us to refer to the column in the context of the table.- We can add multiple conditions that refer to different columns.
.otherwise()gives us the value in the case no condition is met.
The .if_else() method can also
be applied to a DataFrame as a whole.
This is useful in situations where multiple columns are determined under
the same condition. For example, consider a table of food products where
several columns need to be set based on the food's type. Instead of
applying .if_else() on the
respective columns, it is also possible make a
DataFrame for each type of food, and
select the correct instance using .if_else():
cdf = cd.DataFrame({"product": ["apple", "crisps", "beer"],
"type": ["fruit", "snack", "alcoholic drink"]})
cdf_fruit = cdf.assign(healthy=1, vat_tariff=0.09) # values for case 1
cdf_snack = cdf.assign(healthy=0, vat_tariff=0.09) # values for case 2
cdf_drink = cdf.assign(healthy=0, vat_tariff=0.21) # values for case 3
# Select cdf_fruit, cdf_snack, or cdf_drink based on cdf["type"]
>>> (cdf["type"]=="fruit").if_else(cdf_fruit,
(cdf["type"]=="snack").if_else(cdf_snack, cdf_drink)).open()
product type healthy vat_tariff
0 apple fruit 1 0.09
1 crisps snack 0 0.09
2 beer alcoholic drink 0 0.21
Manipulating column names
Crandas provides methods to rename, add suffixes, or add prefixes to column names.
The .rename() method allows you to
rename specific columns by providing a dictionary, where the keys are
the current column names, and the values are the new column names.
df = cd.DataFrame({"col1": [1,2,3], "col2": [4,5,6]})
df = df.rename(columns={"col1": "A", "col2": "B"})
>>> print(list(df.columns))
['A','B']
If instead you would like to add a prefix or a suffix to all column
names in a DataFrame then you should
use either of the following methods:
df = cd.DataFrame({"col1": [1,2,3], "col2": [4,5,6]})
df = df.add_prefix("_1")
df = df.add_suffix('_2')
>>> print(list(df.columns))
['_1col1_2','_1col2_2']
Tip
Adding prefixes or suffixes to your column names can be useful when merging tables which have columns with the same names.
Return value modes
Crandas has different return value modes. By default, the result of a
computation (e.g. an aggregation such as
.mean() or
.sum()) is directly
disclosed to the user. However, it is also possible to keep the value
secret, and only use it as an intermediate result to compute the final
outcome. This can be done using mode="regular":
max1 = table1["col1"].max(mode="regular")
max2 = table2["col1"].max(mode="regular")
>>> max1
<crandas.crandas.ReturnValue at 0x7fa019c94f90>
Now we compute the product of the maximum elements in table1 and
table2, without opening either of them:
We can filter on the elements having the maximum value, without revealing what the maximum is.
Now that we know how to work with columns, in the next section we will find out how to select specific rows and how to filter based on values found in the data.