.. _selecting: Selecting data ########################### Given a table, we often want to only select specific rows, be it because they have a certain value or if we just want to access some random data points. In this section, we discuss various methods for selecting data in crandas, including filtering, slicing, shuffling, and sampling. Filtering ========= Let's say we want to work with the rows of a table that fulfill a specific property. In previous sections, we found how to do comparisons to get a column marking whenever those properties are fulfilled. Now, we can use that to select only the rows that have the properties that we want. The :meth:`.CDataFrame.filter` method is used to filter rows of a :class:`.CDataFrame` based on a given condition. .. code:: python import crandas as cd df = cd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) # We can use bracket notation to filter a table filtered_df = df[df["A"] != 2] # Functional notation is also available filtered_df2 = df.filter(df["B"] > 5) In the example above, we filter df using the condition that column ``A`` is not equal to two, which returns a boolean series that is subsequently used to filter the rows of the :class:`.CDataFrame` ``df``. The resulting :class:`.CDataFrame`, ``filtered_df``, contains only the rows where the value of column ``A`` not two. More complex filters are possible, using the logical operators `and` ``&``, `or` ``|`` and `xor` ``^``: .. code:: python # Filter such that remaining rows meet the conditions that `A > 1` OR `B < 6` df = df[(df["A"] > 1) | (df["B"] < 6)] The filtering operation allows for operations *with threshold*. If it is called this way, it will only return an answer if a minimum number of rows would be selected after applying the filter. Multiple operations in crandas can require a threshold, for more information see :ref:`privacy`. In order to do this, add the key ``threshold = t`` to the value. It is also possible to do this with bracket notation, but it is cleaner in the functional notation: .. code:: python # This query will work as expected filtered_df = df[(df["A"] != 2).with_threshold(2)] # This query will not return a table, as only one row fulfills the condition filtered_df2 = df.filter(df["B"] > 5, threshold = 2) Slicing ======== Sometimes, you will want to select certain rows of a table based not in the data in it, but selecting rows based in their indices. To do this, you can use the :meth:`.CDataFrame.slice` method, which takes a python `slice `_. The slice object determines the `starting` index, the `stopping` index as well as the size of the `step`. The method then returns a new :class:`.CDataFrame` with the specified rows. .. code:: python df = cd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [4, 5, 6, 7, 8]}) # This table contains the first four rows sliced_df = df.slice(slice(4)) # This table contains the second and third rows (indices 1 and 2) sliced_df = df.slice(slice(1, 3)) # This table contains the first, third and last rows sliced_df = df.slice(slice(0, 5, 2)) Probably, you are more familiar with python's extended indexing notation. You can also use it to slice a :class:`.CDataFrame`: .. code:: python # This table contains the first four rows sliced_df = df[:4:] # This table contains the second and third rows (indices 1 and 2) sliced_df = df[1:3] # This table contains the first, third and last rows sliced_df = df[::2] Shuffling and sampling ======================= If we want to have access to random rows of a table, we have two options. We can shuffle the table using :meth:`.CDataFrame.shuffle` and then used the slicing mechanism we just learned. Alternatively, we can simply use the :meth:`.CDataFrame.sample` method to get a specific number of rows, or even a fraction of the total dataset. .. code:: python import crandas as cd # This will shuffle the table shuffled_df = df.shuffle() # This will generate a table that is half the size of the original, with random rows sampled_df1 = df.sample(frac=0.5) # This will output two random rows of the table sampled_df2 = df.sample(n=2) In this section we have learned to select rows of a table, be it randomly, based on their position on the table or based on their values. In the next section we will go deeper on how data is stored in the VDl and the different things we can do with it.