Descriptive statistics

When doing data analysis, it is useful to have statistical measures to understand the data. Because we cannot access the data in the engine, statistics are especially useful to understand the shape of the data. Beyond aggregations like the mean and the variance which we saw in Working with numeric data, crandas provides more powerful tools.

Correlation table

The most simple statistic is the correlation table. Given a table with numeric columns, we can create a table which computes the Pearson correlation coefficient between each column. For more information on the Pearson coefficient, see here. We can call CDataFrame.corr() on any table with numeric columns to create a new CDataFrame with the correlations between columns. We can use the parameter min_periods to determine a threshold, such that no correlation is calculated if there are not min_periods values in the column.

>>> import crandas as cd
>>> import pandas as pd

>>> df = cd.DataFrame(
    {
        "col1": [1, 2, 3, 4, 5],
        "col2": [5, 4, 3, 2, 1],
        "col3": [2, 4, 7, 4, 1],
        "col4": [52, 38, 7405, 3, 65],
    }
)
>>> corr_df = df.corr()
>>> corr.open()
            col1      col2      col3      col4
col1  1.000000 -0.999999 -0.137361 -0.000432
col2 -0.999999  1.000000  0.137361  0.000432
col3 -0.137361  0.137361  1.000000  0.822226
col4 -0.000432  0.000432  0.822226  1.000000

Pearson correlation can only be computed over numeric columns. If we attempt to do it over any other type of column, we get an error.

# We add a column of strings
df = df.assign(col_string=["cannot", "compute", "correlation", "on", "strings"])
# This will give us an error because not all columns are numeric
corr_df = df.corr()
# We can avoid it by adding `numeric_only=True`
df.corr(numeric_only=True)

When we do not want the entire table but just the correlation from two columns, we can use CSeries.corr().

>>> col_corr = df["col1"].corr(df["col3"])
>>> col_corr.open()
-0.1373605728149414

Ranking

Sometimes, we are less interested in the actual values but more in the ranking between them and how they interact. In order to do this, we have the rankdata function. Modelled after the function of the same name from the SciPy package, it allows us to create a ranking of any numerical column. For example, given the array [42, 21, 5, 99], we will receive a ranking [3, 2, 1, 4].

>>> import crandas as cd
>>> import crandas.stats

>>> df = cd.DataFrame({"a": [5, 3, 4, 8, 9, 10, 7, 1]}, auto_bounds=True)
>>> ranked = crandas.stats.rankdata(df["a"])
>>> ranked.open().to_numpy()
array([4., 2., 3., 6., 7., 8., 5., 1.])

The process is not as straightforward when there are multiple entries with the same value. In that case, we can use the method option, which determines what will be done with the tied values. The available options are average, min and max.

>>> df = cd.DataFrame({"a": [1, 1, 3, 3, 3, 4, 4, 5, 5, 8, 9, 9, 10]}, auto_bounds=True)

# The average of the ranks of the tied values is assigned to each of the values
>>> ranked = crandas.stats.rankdata(df["a"], method="average")
>>> ranked.open().to_numpy()
array([ 1.5,  1.5,  4. ,  4. ,  4. ,  6.5,  6.5,  8.5,  8.5, 10. , 11.5,
   11.5, 13. ])

# The minimum of the ranks of the tied values is assigned to each of the values
>>> ranked = crandas.stats.rankdata(df["a"], method="min")
>>> ranked.open().to_numpy()
array([ 1,  1,  3,  3,  3,  6,  6,  8,  8, 10, 11, 11, 13])

# The maximum of the ranks of the tied values is assigned to each of the values
>>> ranked = crandas.stats.rankdata(df["a"], method="max")
>>> ranked.open().to_numpy()
array([ 2,  2,  5,  5,  5,  7,  7,  9,  9, 10, 12, 12, 13])

Contingency tables

Another tool that is useful when doing data exploration is checking how many times does each pair of values appear in a table. For this, we will use the crosstab function. This functions takes two columns from the same table and computes the appearance of each pair of values. The levels field determines which values will be taken into account. They are obligatory for the second column, but not for the first one. If they are not specified, all of the values will be used.

import crandas.contingency

df = cd.DataFrame(
    {
        "a": [1, 1, 3, 3, 3, 4, 4, 5, 5, 8, 9, 9, 10],
        "b": [0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1],
    }
    )
# The tuple in `levels` includes a value for the first and second column
# Because we want all the values of the first one, we put `None`
# But for the second one, we need to add all values
cont_tab = crandas.stats.contingency.crosstab(df["a"],df["b"],levels=(None,[0,1]))

The result of this table is based on SciPy’s CrosstabResult. When opening the result, we need to know what we are looking at. The pair of values under elements represent the values in the first and second columns. They can be interpreted as the row and column labels respectively. The table in count is the number of appearances of each pair.

>>> cont_tab.open()
CrosstabResult(elements=([1, 3, 4, 5, 8, 9, 10], [0, 1]), count=
    y0  y1
0   1   1
1   0   3
2   2   0
3   1   1
4   0   1
5   2   0
6   0   1
)

In the previous table, the 3 in the second column represents that there are three instances of the pair (3,1).

# If we change the value in ``levels``, we will get only a subset of the values
>>> crandas.stats.contingency.crosstab(df["a"],df["b"],levels=([1,2,3,4,5],[0,1])).open()
CrosstabResult(elements=([1, 2, 3, 4, 5], [0, 1]), count=
    y0  y1
0   1   1
1   0   0
2   0   3
3   2   0
4   1   1
)

# We always need to add the `levels` value for the second column
>>> crandas.stats.contingency.crosstab(df["a"],df["b"],levels=([1,2,3,4,5],None)).open()
ValueError: The `levels` argument needs to be set for column 'b'

Contingency tables can be used to do some exploratory data analysis that still respects privacy but can be already useful.

Hypothesis testing

Crandas contains a number of hypothesis testing functionalities. These can be used for more complex data exploration, to see whether certain statistical properties are found in the data. An actual implementation guide can be found in the API guide, but here we list all four tests and what they do:

  • chisquare: Pearson’s Chi-Square test, tests whether it is likely that observed differences in categorical data arose by chance.

  • chi2_contingency: Chi-square test of independence on a contingency table.

  • ttest_ind: T-test, tests whether the difference between the response of two groups is statistically significant or not.

  • kruskal: Kruskal-Wallis test, tests that the population median of all of the groups are equal.