.. _desc_stats:

Descriptive statistics
########################

When doing data analysis, it is useful to have statistical measures to understand the data.
Because we cannot access the data in the engine, statistics are especially useful to understand the shape of the data.
Beyond aggregations like the mean and the variance which we saw in :ref:`numeric`, crandas provides more powerful tools.

Correlation table
==================

The most simple statistic is the correlation table.
Given a table with numeric columns, we can create a table which computes the Pearson correlation coefficient between each column.
For more information on the Pearson coefficient, see `here <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_.
We can call :meth:`.DataFrame.corr` on any table with numeric columns to create a new :class:`.DataFrame` with the correlations between columns.
We can use the parameter ``min_periods`` to determine a threshold, such that no correlation is calculated if there are not ``min_periods`` values in the column.

.. code:: python

    >>> import crandas as cd
    >>> import pandas as pd

    >>> df = cd.DataFrame(
        {
            "col1": [1, 2, 3, 4, 5],
            "col2": [5, 4, 3, 2, 1],
            "col3": [2, 4, 7, 4, 1],
            "col4": [52, 38, 7405, 3, 65],
        }
    )
    >>> corr_df = df.corr()
    >>> corr.open()
                col1      col2      col3      col4
    col1  1.000000 -0.999999 -0.137361 -0.000432
    col2 -0.999999  1.000000  0.137361  0.000432
    col3 -0.137361  0.137361  1.000000  0.822226
    col4 -0.000432  0.000432  0.822226  1.000000

Pearson correlation can only be computed over numeric columns.
If we attempt to do it over any other type of column, we get an error.

.. code:: python

    # We add a column of strings
    df["col_string"] = ["cannot", "compute", "correlation", "on", "strings"]
    # This will give us an error because not all columns are numeric
    corr_df = df.corr()
    # We can avoid it by adding `numeric_only=True`
    df.corr(numeric_only=True)

When we do not want the entire table but just the correlation from two columns, we can use :meth:`.CSeries.corr`.

.. code:: python

    >>> col_corr = df["col1"].corr(df["col3"])
    >>> col_corr.open()
    -0.1373605728149414

Ranking
========

Sometimes, we are less interested in the actual values but more in the ranking between them and how they interact.
In order to do this, we have the :func:`rankdata<.stats.ranking.rankdata>` function.
Modelled after the `function <https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rankdata.html>`_ of the same name from the SciPy package,
it allows us to create a ranking of any numerical column.
For example, given the array ``[42, 21, 5, 99]``, we will receive a ranking ``[3, 2, 1, 4]``.

.. code:: python

    >>> import crandas as cd
    >>> import crandas.stats

    >>> df = cd.DataFrame({"a": [5, 3, 4, 8, 9, 10, 7, 1]}, auto_bounds=True)
    >>> ranked = crandas.stats.rankdata(df["a"])
    >>> ranked.open().to_numpy()
    array([4., 2., 3., 6., 7., 8., 5., 1.])

The process is not as straightforward when there are multiple entries with the same value.
In that case, we can use the ``method`` option, which determines what will be done with the tied values.
The available options are ``average``, ``min`` and ``max``.

.. code:: python

    >>> df = cd.DataFrame({"a": [1, 1, 3, 3, 3, 4, 4, 5, 5, 8, 9, 9, 10]}, auto_bounds=True)

    # The average of the ranks of the tied values is assigned to each of the values
    >>> ranked = crandas.stats.rankdata(df["a"], method="average")
    >>> ranked.open().to_numpy()
    array([ 1.5,  1.5,  4. ,  4. ,  4. ,  6.5,  6.5,  8.5,  8.5, 10. , 11.5,
       11.5, 13. ])

    # The minimum of the ranks of the tied values is assigned to each of the values
    >>> ranked = crandas.stats.rankdata(df["a"], method="min")
    >>> ranked.open().to_numpy()
    array([ 1,  1,  3,  3,  3,  6,  6,  8,  8, 10, 11, 11, 13])

    # The maximum of the ranks of the tied values is assigned to each of the values
    >>> ranked = crandas.stats.rankdata(df["a"], method="max")
    >>> ranked.open().to_numpy()
    array([ 2,  2,  5,  5,  5,  7,  7,  9,  9, 10, 12, 12, 13])

.. _cont_table:

Contingency tables
===================

Another tool that is useful when doing data exploration is checking how many times does each `pair` of values appear in a table.
For this, we will use the :func:`crosstab<.crandas.stats.contingency.crosstab>` function.
This function takes two columns from the same table and computes the appearance of each pair of values.
The ``levels`` field determines which values will be taken into account.
We might only care about the appearance of some but not all of the values.
If they are not specified, all of the values will be used.
The ``levels`` parameter is **obligatory** for the second column, but not for the first one.

.. code:: python

    import crandas.contingency

    df = cd.DataFrame(
        {
            "a": [1, 1, 3, 3, 3, 4, 4, 5, 5, 8, 9, 9, 10],
            "b": [0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1],
        }
        )
    # The tuple in `levels` includes a value for the first and second column
    # Because we want all the values of the first one, we put `None`
    # But for the second one, we need to add all values
    cont_tab = crandas.stats.contingency.crosstab(df["a"],df["b"],levels=(None,[0,1]))

The result of this table is based on SciPy's ``CrosstabResult``.
When opening the result, we need to know what we are looking at.
The pair of values under ``elements`` represent the values in the first and second columns.
They can be interpreted as the row and column labels respectively.
The table in ``count`` is the number of appearances of each pair.

.. code:: python

    >>> cont_tab.open()
    CrosstabResult(elements=([1, 3, 4, 5, 8, 9, 10], [0, 1]), count=
        y0  y1
    0   1   1
    1   0   3
    2   2   0
    3   1   1
    4   0   1
    5   2   0
    6   0   1
    )

In the previous table, the value ``3`` is in the second row of the second column.
``elements`` contains two lists, first the rows and then the columns.
Because the ``3`` is in the second row of the second column, then it corresponds to the second value in each of the lists (``3`` and ``1`` respectively).
Therefore, there are three instances of the pair ``(3,1)``.

.. code:: python

    # If we change the value in ``levels``, we will get only a subset of the values
    >>> crandas.stats.contingency.crosstab(df["a"],df["b"],levels=([1,2,3,4,5],[0,1])).open()
    CrosstabResult(elements=([1, 2, 3, 4, 5], [0, 1]), count=
        y0  y1
    0   1   1
    1   0   0
    2   0   3
    3   2   0
    4   1   1
    )

    # We always need to add the `levels` value for the second column
    >>> crandas.stats.contingency.crosstab(df["a"],df["b"],levels=([1,2,3,4,5],None)).open()
    ValueError: The 'levels' argument needs to be set for column 'b'

Contingency tables can be used to do some exploratory data analysis that still respects privacy but can be already useful.

Hypothesis testing
====================

Crandas contains a number of hypothesis testing functionalities.
These can be used for more complex data exploration, to see whether certain statistical properties are found in the data.
An actual implementation guide can be found in the :ref:`API guide<hypo_tests>`, but here we list all four tests and what they do:

- :func:`chisquare<.stats.hypothesis_testing.chisquare>`: Pearson's Chi-Square test, tests whether it is likely that observed differences in categorical data arose by chance.
- :func:`chi2_contingency<.stats.hypothesis_testing.chi2_contingency>`: Chi-square test of independence on a :ref:`contingency table<cont_table>`.
- :func:`ttest_ind<.stats.hypothesis_testing.ttest_ind>`: T-test, tests whether the difference between the response of two groups is statistically significant or not.
- :func:`kruskal<.stats.hypothesis_testing.kruskal>`: Kruskal-Wallis test, tests that the population median of all of the groups are equal.