Selecting data¶
Given a table, we often want to only select specific rows, be it because they have a certain value or if we just want to access some random data points. In this section, we discuss various methods for selecting data in crandas, including filtering, slicing, shuffling, and sampling.
Filtering¶
Let’s say we want to work with the rows of a table that fulfill a specific property.
In previous sections, we found how to do comparisons to get a column marking whenever those properties are fulfilled.
Now, we can use that to select only the rows that have the properties that we want.
The CDataFrame.filter() method is used to filter rows of a CDataFrame based on a given condition.
import crandas as cd
df = cd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
# We can use bracket notation to filter a table
filtered_df = df[df["A"] != 2]
# Functional notation is also available
filtered_df2 = df.filter(df["B"] > 5)
In the example above, we filter df using the condition that column A is not equal to two, which returns a boolean series that is subsequently used to filter the rows of the CDataFrame df. The resulting CDataFrame, filtered_df, contains only the rows where the value of column A not two.
More complex filters are possible, using the logical operators and &, or | and xor ^:
# Filter such that remaining rows meet the conditions that `A > 1` OR `B < 6`
df = df[(df["A"] > 1) | (df["B"] < 6)]
The filtering operation allows for operations with threshold.
If it is called this way, it will only return an answer if a minimum number of rows would be selected after applying the filter.
Multiple operations in crandas can require a threshold, for more information see privacy.
In order to do this, add the key threshold = t to the value.
It is also possible to do this with bracket notation, but it is cleaner in the functional notation:
# This query will work as expected
filtered_df = df[(df["A"] != 2).with_threshold(2)]
# This query will not return a table, as only one row fulfills the condition
filtered_df2 = df.filter(df["B"] > 5, threshold = 2)
Slicing¶
Sometimes, you will want to select certain rows of a table based not in the data in it, but selecting rows based in their indices.
To do this, you can use the CDataFrame.slice() method, which takes a python slice.
The slice object determines the starting index, the stopping index as well as the size of the step.
The method then returns a new CDataFrame with the specified rows.
df = cd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [4, 5, 6, 7, 8]})
# This table contains the first four rows
sliced_df = df.slice(slice(4))
# This table contains the second and third rows (indices 1 and 2)
sliced_df = df.slice(slice(1, 3))
# This table contains the first, third and last rows
sliced_df = df.slice(slice(0, 5, 2))
Probably, you are more familiar with python’s extended indexing notation.
You can also use it to slice a CDataFrame:
# This table contains the first four rows
sliced_df = df[:4:]
# This table contains the second and third rows (indices 1 and 2)
sliced_df = df[1:3]
# This table contains the first, third and last rows
sliced_df = df[::2]
Shuffling and sampling¶
If we want to have access to random rows of a table, we have two options.
We can shuffle the table using CDataFrame.shuffle() and then used the slicing mechanism we just learned.
Alternatively, we can simply use the CDataFrame.sample() method to get a specific number of rows, or even a fraction of the total dataset.
import crandas as cd
# This will shuffle the table
shuffled_df = df.shuffle()
# This will generate a table that is half the size of the original, with random rows
sampled_df1 = df.sample(frac=0.5)
# This will output two random rows of the table
sampled_df2 = df.sample(n=2)
In this section we have learned to select rows of a table, be it randomly, based on their position on the table or based on their values. In the next section we will go deeper on how data is stored in the engine and the different things we can do with it.