Selecting data¶

Given a table, we often want to only select specific rows, be it because they have a certain value or if we just want to access some random data points. In this section, we discuss various methods for selecting data in crandas, including filtering, slicing, shuffling, and sampling.

Filtering¶

Let’s say we want to work with the rows of a table that fulfill a specific property. In previous sections, we found how to do comparisons to get a column marking whenever those properties are fulfilled. Now, we can use that to select only the rows that have the properties that we want. The DataFrame.filter() method is used to filter rows of a DataFrame based on a given condition.

import crandas as cd
df = cd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

# We can use bracket notation to filter a table
filtered_df = df[df["A"] != 2]

# Functional notation is also available
filtered_df2 = df.filter(df["B"] > 5)

In the example above, we filter df using the condition that column A is not equal to two, which returns a boolean series that is subsequently used to filter the rows of the DataFrame df. The resulting DataFrame, filtered_df, contains only the rows where the value of column A not two.

More complex filters are possible, using the logical operators and &, or | and xor ^:

# Filter such that remaining rows meet the conditions that `A > 1` OR `B < 6`
df = df[(df["A"] > 1) | (df["B"] < 6)]

The filtering operation allows for operations with threshold. If it is called this way, it will only return an answer if a minimum number of rows would be selected after applying the filter. Multiple operations in crandas can require a threshold, for more information see Appendix 2: Thresholds. In order to do this, add the key threshold = t to the value. It is also possible to do this with bracket notation, but it is cleaner in the functional notation:

# This query will work as expected
filtered_df = df[(df["A"] != 2).with_threshold(2)]

# This query will not return a table, as only one row fulfills the condition
filtered_df2 = df.filter(df["B"] > 5, threshold = 2)

Sometimes we want to use complex filters that are based in more than one column. In that case, building a filter can become complex, especially as we will not be able to use python operators like and or or. For easier readability and coding, we can save filters individually and combine them:

# Filter such that remaining rows meet the conditions that `A > 1` OR `B < 6`
filter_a_bigger_one = df["A"] > 1
filter_b_under_six = df["B"] < 6
filter_both = filter_a_bigger_one | filter_b_under_six

# We can now use the filter for the table, even adding a threshold
filtered_df3 = df[filter_both.with_threshold(2)]

Slicing¶

Sometimes, you will want to select certain rows of a table based not in the data in it, but selecting rows based in their indices. To do this, you can use the DataFrame.slice() method, which takes a python slice. The slice object determines the starting index, the stopping index as well as the size of the step.

The method then returns a new DataFrame with the specified rows.

df = cd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [4, 5, 6, 7, 8]})

# This table contains the first four rows
sliced_df = df.slice(slice(4))

# This table contains the second and third rows (indices 1 and 2)
sliced_df = df.slice(slice(1, 3))

# This table contains the first, third and last rows
sliced_df = df.slice(slice(0, 5, 2))

It is also possible to use DataFrame.slice() to get a fraction of the rows of a table. In order to do that, you must set allow_fractions=True

# This table contains the first four rows
sliced_df = df.slice(slice(0.8), allow_fractions=True)

# This table contains the second and third rows (indices 1 and 2)
sliced_df = df.slice(slice(0.1,0.6),allow_fractions=True)

# This table contains the first, third and last rows
sliced_df = df.slice(slice(0,1.0,2),allow_fractions=True)

Probably, you are more familiar with python’s extended indexing notation. You can also use it to slice a DataFrame:

# This table contains the first four rows
sliced_df = df[:4:]

# This table contains the second and third rows (indices 1 and 2)
sliced_df = df[1:3]

# This table contains the first, third and last rows
sliced_df = df[::2]

Shuffling and sampling¶

If we want to have access to random rows of a table, we have two options. We can shuffle the table using DataFrame.shuffle() and then used the slicing mechanism we just learned. Alternatively, we can simply use the DataFrame.sample() method to get a specific number of rows, or even a fraction of the total dataset.

import crandas as cd

# This will shuffle the table
shuffled_df = df.shuffle()

# This will generate a table that is half the size of the original, with random rows
sampled_df1 = df.sample(frac=0.5)

# This will output two random rows of the table
sampled_df2 = df.sample(n=2)

In this section we have learned to select rows of a table, be it randomly, based on their position on the table or based on their values. In the next section we will go deeper on how data is stored in the engine and the different things we can do with it.

Deduplication¶

To remove duplicates from a table or series, we can use the function DataFrame.drop_duplicates() or CSeries.drop_duplicates(). We can identify duplicated rows by looking for whole rows of a table to be identical, that is, two rows are said to be duplicates if all values are identical. Alternatively, we can also identify duplicated rows by comparing only a fixed subset of values in each row by providing the function a list of column names as the first argument.

The first, the last, any occurrence or none of the identified duplicated rows can be kept in the result, using the parameter keep. We must set keep = 'first' (default) to keep the first occurrence, or set keep = 'last' to keep the last occurrence. When we wish to remove all duplicated occurrences we must set keep = False. When ordering does not matter and any occurrence is fine, we set keep = 'any' which is faster than its alternatives.

Note

The contents of dataframes that are the result of secure computation are typically ordered randomly. To meaningfully keep the first or last occurrence of a set of duplicates the dataframe typically needs to be sorted before removing duplicates.

import crandas as cd
df = cd.DataFrame({"a": [0, 1, 1, 2], "b": ["a", "b", "b", "c"], "c": [True, False, False, True]}, auto_bounds=True)

# This drops all duplicated rows except for one from the table
duplicated_rows_dropped = df.drop_duplicates()

# This drops all duplicated rows, except for the first occurence, where columns 'a' and 'b' are identical
duplicated_rows_dropped_ab = df.drop_duplicates(['a', 'b'])

# This drops all duplicated rows, except for the last occurence, where columns 'a' and 'b' are identical
duplicated_rows_dropped_ab_last = df.drop_duplicates(['a', 'b'], keep='last')

# This drops all duplicated rows, where column 'c' is identical
duplicated_rows_dropped_ab_none = df.drop_duplicates(['c'], keep=False)

# This drops all duplicated values from column 'a'
duplicated_series_a_dropped = df['a'].drop_duplicates(keep=False)