.. _selecting:

Selecting data
###########################

Given a table, we often want to only select specific rows, be it because they have a certain value or if we just want to access some random data points.
In this section, we discuss various methods for selecting data in crandas, including filtering, slicing, shuffling, and sampling.

Filtering 
=========

Let's say we want to work with the rows of a table that fulfill a specific property. 
In previous sections, we found how to do comparisons to get a column marking whenever those properties are fulfilled.
Now, we can use that to select only the rows that have the properties that we want.
The :meth:`.CDataFrame.filter` method is used to filter rows of a :class:`.CDataFrame` based on a given condition. 

.. code:: python

    import crandas as cd
    df = cd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

    # We can use bracket notation to filter a table 
    filtered_df = df[df["A"] != 2] 

    # Functional notation is also available
    filtered_df2 = df.filter(df["B"] > 5)

In the example above, we filter df using the condition that column ``A`` is not equal to two, which returns a boolean series that is subsequently used to filter the rows of the :class:`.CDataFrame` ``df``. The resulting :class:`.CDataFrame`, ``filtered_df``, contains only the rows where the value of column ``A`` not two.

More complex filters are possible, using the logical operators `and` ``&``, `or` ``|`` and `xor` ``^``:

.. code:: python

    # Filter such that remaining rows meet the conditions that `A > 1` OR `B < 6`
    df = df[(df["A"] > 1) | (df["B"] < 6)]

The filtering operation allows for operations *with threshold*. 
If it is called this way, it will only return an answer if a minimum number of rows would be selected after applying the filter.
Multiple operations in crandas can require a threshold, for more information see :ref:`privacy`.
In order to do this, add the key ``threshold = t`` to the value.
It is also possible to do this with bracket notation, but it is cleaner in the functional notation:

.. code:: python

    # This query will work as expected
    filtered_df = df[(df["A"] != 2).with_threshold(2)]

    # This query will not return a table, as only one row fulfills the condition
    filtered_df2 = df.filter(df["B"] > 5, threshold = 2)


Slicing
========

Sometimes, you will want to select certain rows of a table based not in the data in it, but selecting rows based in their indices. 
To do this, you can use the :meth:`.CDataFrame.slice` method, which takes a python `slice <https://docs.python.org/3/library/functions.html#slice>`_.
The slice object determines the `starting` index, the `stopping` index as well as the size of the `step`.

The method then returns a new :class:`.CDataFrame` with the specified rows.

.. code:: python

    df = cd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [4, 5, 6, 7, 8]})

    # This table contains the first four rows
    sliced_df = df.slice(slice(4))

    # This table contains the second and third rows (indices 1 and 2)
    sliced_df = df.slice(slice(1, 3))

    # This table contains the first, third and last rows
    sliced_df = df.slice(slice(0, 5, 2))


Probably, you are more familiar with python's extended indexing notation.
You can also use it to slice a :class:`.CDataFrame`:

.. code:: python

    # This table contains the first four rows
    sliced_df = df[:4:]

    # This table contains the second and third rows (indices 1 and 2)
    sliced_df = df[1:3]

    # This table contains the first, third and last rows
    sliced_df = df[::2]


Shuffling and sampling
=======================

If we want to have access to random rows of a table, we have two options. 
We can shuffle the table using :meth:`.CDataFrame.shuffle` and then used the slicing mechanism we just learned.
Alternatively, we can simply use the :meth:`.CDataFrame.sample` method to get a specific number of rows, or even a fraction of the total dataset.

.. code:: python

    import crandas as cd

    # This will shuffle the table
    shuffled_df = df.shuffle()

    # This will generate a table that is half the size of the original, with random rows
    sampled_df1 = df.sample(frac=0.5)

    # This will output two random rows of the table
    sampled_df2 = df.sample(n=2)

In this section we have learned to select rows of a table, be it randomly, based on their position on the table or based on their values. 
In the next section we will go deeper on how data is stored in the engine and the different things we can do with it.