.. _selecting: .. meta:: :description: Filter, slice, and sample encrypted tables in crandas. Achieve privacy-preserving data selection with threshold controls for global compliance. :keywords: crandas, Roseman Labs, filter, slice, select, DataFrame, User Guide Selecting data ########################### Given a table, we often want to only select specific rows, be it because they have a certain value or if we just want to access some random data points. In this section, we discuss various methods for selecting data in crandas, including filtering, slicing, shuffling, and sampling. Filtering ========= Let's say we want to work with the rows of a table that fulfill a specific property. In previous sections, we found how to do comparisons to get a column marking whenever those properties are fulfilled. Now, we can use that to select only the rows that have the properties that we want. The :meth:`.DataFrame.filter` method is used to filter rows of a :class:`.DataFrame` based on a given condition. .. code:: python import crandas as cd df = cd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) # We can use bracket notation to filter a table filtered_df = df[df["A"] != 2] # Functional notation is also available filtered_df2 = df.filter(df["B"] > 5) In the example above, we filter df using the condition that column ``A`` is not equal to two, which returns a boolean series that is subsequently used to filter the rows of the :class:`.DataFrame` ``df``. The resulting :class:`.DataFrame`, ``filtered_df``, contains only the rows where the value of column ``A`` not two. More complex filters are possible, using the logical operators `and` ``&``, `or` ``|`` and `xor` ``^``: .. code:: python # Filter such that remaining rows meet the conditions that `A > 1` OR `B < 6` df = df[(df["A"] > 1) | (df["B"] < 6)] The filtering operation allows for operations *with threshold*. If it is called this way, it will only return an answer if a minimum number of rows would be selected after applying the filter. Multiple operations in crandas can require a threshold, for more information see :ref:`threshold-label`. In order to do this, add the key ``threshold = t`` to the value. It is also possible to do this with bracket notation, but it is cleaner in the functional notation: .. code:: python # This query will work as expected filtered_df = df[(df["A"] != 2).with_threshold(2)] # This query will not return a table, as only one row fulfills the condition filtered_df2 = df.filter(df["B"] > 5, threshold = 2) Sometimes we want to use complex filters that are based in more than one column. In that case, building a filter can become complex, especially as we will not be able to use python operators like ``and`` or ``or``. For easier readability and coding, we can save filters individually and combine them: .. code:: python # Filter such that remaining rows meet the conditions that `A > 1` OR `B < 6` filter_a_bigger_one = df["A"] > 1 filter_b_under_six = df["B"] < 6 filter_both = filter_a_bigger_one | filter_b_under_six # We can now use the filter for the table, even adding a threshold filtered_df3 = df[filter_both.with_threshold(2)] Slicing ======== Sometimes, you will want to select certain rows of a table based not in the data in it, but selecting rows based in their indices. To do this, you can use the :meth:`.DataFrame.slice` method, which takes a python `slice `_. The slice object determines the `starting` index, the `stopping` index as well as the size of the `step`. The method then returns a new :class:`.DataFrame` with the specified rows. .. code:: python df = cd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [4, 5, 6, 7, 8]}) # This table contains the first four rows sliced_df = df.slice(slice(4)) # This table contains the second and third rows (indices 1 and 2) sliced_df = df.slice(slice(1, 3)) # This table contains the first, third and last rows sliced_df = df.slice(slice(0, 5, 2)) It is also possible to use :meth:`.DataFrame.slice` to get a fraction of the rows of a table. In order to do that, you must set ``allow_fractions=True`` .. code:: python # This table contains the first four rows sliced_df = df.slice(slice(0.8), allow_fractions=True) # This table contains the second and third rows (indices 1 and 2) sliced_df = df.slice(slice(0.1,0.6),allow_fractions=True) # This table contains the first, third and last rows sliced_df = df.slice(slice(0,1.0,2),allow_fractions=True) Probably, you are more familiar with python's extended indexing notation. You can also use it to slice a :class:`.DataFrame`: .. code:: python # This table contains the first four rows sliced_df = df[:4:] # This table contains the second and third rows (indices 1 and 2) sliced_df = df[1:3] # This table contains the first, third and last rows sliced_df = df[::2] Shuffling and sampling ======================= If we want to have access to random rows of a table, we have two options. We can shuffle the table using :meth:`.DataFrame.shuffle` and then used the slicing mechanism we just learned. Alternatively, we can simply use the :meth:`.DataFrame.sample` method to get a specific number of rows, or even a fraction of the total dataset. .. code:: python import crandas as cd # This will shuffle the table shuffled_df = df.shuffle() # This will generate a table that is half the size of the original, with random rows sampled_df1 = df.sample(frac=0.5) # This will output two random rows of the table sampled_df2 = df.sample(n=2) In this section we have learned to select rows of a table, be it randomly, based on their position on the table or based on their values. In the next section we will go deeper on how data is stored in the engine and the different things we can do with it. Deduplication ============== To remove duplicates from a table or series, we can use the function :meth:`.DataFrame.drop_duplicates` or :meth:`.CSeries.drop_duplicates`. We can identify duplicated rows by looking for whole rows of a table to be identical, that is, two rows are said to be duplicates if all values are identical. Alternatively, we can also identify duplicated rows by comparing only a fixed subset of values in each row by providing the function a list of column names as the first argument. The first, the last, any occurrence or none of the identified duplicated rows can be kept in the result, using the parameter ``keep``. We must set ``keep = 'first'`` (default) to keep the first occurrence, or set ``keep = 'last'`` to keep the last occurrence. When we wish to remove all duplicated occurrences we must set ``keep = False``. When ordering does not matter and any occurrence is fine, we set ``keep = 'any'`` which is faster than its alternatives. .. note:: The contents of dataframes that are the result of secure computation are typically ordered randomly. To meaningfully keep the first or last occurrence of a set of duplicates the dataframe typically needs to be sorted before removing duplicates. .. code:: python import crandas as cd df = cd.DataFrame({"a": [0, 1, 1, 2], "b": ["a", "b", "b", "c"], "c": [True, False, False, True]}, auto_bounds=True) # This drops all duplicated rows except for one from the table duplicated_rows_dropped = df.drop_duplicates() # This drops all duplicated rows, except for the first occurence, where columns 'a' and 'b' are identical duplicated_rows_dropped_ab = df.drop_duplicates(['a', 'b']) # This drops all duplicated rows, except for the last occurence, where columns 'a' and 'b' are identical duplicated_rows_dropped_ab_last = df.drop_duplicates(['a', 'b'], keep='last') # This drops all duplicated rows, where column 'c' is identical duplicated_rows_dropped_ab_none = df.drop_duplicates(['c'], keep=False) # This drops all duplicated values from column 'a' duplicated_series_a_dropped = df['a'].drop_duplicates(keep=False)