Descriptive statistics
When doing data analysis, it is useful to have statistical measures to understand the data. Because we cannot access the data in the engine, statistics are especially useful to understand the shape of the data. Beyond aggregations like the mean and the variance which we saw in Working with numeric data, crandas provides more powerful tools.
Correlation table
The most simple statistic is the correlation table. Given a table with
numeric columns, we can create a table which computes the Pearson
correlation coefficient between each column. For more information on the
Pearson coefficient, see
here.
We can call .corr() on any
table with numeric columns to create a new
DataFrame with the correlations
between columns. We can use the parameter min_periods to determine a
threshold, such that no correlation is calculated if there are not
min_periods values in the column.
import crandas as cd
import pandas as pd
>>> df = cd.DataFrame(
{
"col1": [1, 2, 3, 4, 5],
"col2": [5, 4, 3, 2, 1],
"col3": [2, 4, 7, 4, 1],
"col4": [52, 38, 7405, 3, 65],
}
)
corr_df = df.corr()
>>> corr.open()
col1 col2 col3 col4
col1 1.000000 -0.999999 -0.137361 -0.000432
col2 -0.999999 1.000000 0.137361 0.000432
col3 -0.137361 0.137361 1.000000 0.822226
col4 -0.000432 0.000432 0.822226 1.000000
Pearson correlation can only be computed over numeric columns. If we attempt to do it over any other type of column, we get an error.
# We add a column of strings
df["col_string"] = ["cannot", "compute", "correlation", "on", "strings"]
# This will give us an error because not all columns are numeric
corr_df = df.corr()
# We can avoid it by adding `numeric_only=True`
df.corr(numeric_only=True)
When we do not want the entire table but just the correlation from two
columns, we can use .corr().
Ranking
Sometimes, we are less interested in the actual values but more in the
ranking between them and how they interact. In order to do this, we have
the rankdata() function. Modelled after the
function
of the same name from the SciPy package, it allows us to create a
ranking of any numerical column. For example, given the array
[42, 21, 5, 99], we will receive a ranking [3, 2, 1, 4].
import crandas.stats
df = cd.DataFrame({"a": [5, 3, 4, 8, 9, 10, 7, 1]}, auto_bounds=True)
ranked = crandas.stats.rankdata(df["a"])
>>> ranked.open().to_numpy()
array([4., 2., 3., 6., 7., 8., 5., 1.])
The process is not as straightforward when there are multiple entries
with the same value. In that case, we can use the method option, which
determines what will be done with the tied values. The available options
are average, min and max.
df = cd.DataFrame({"a": [1, 1, 3, 3, 3, 4, 4, 5, 5, 8, 9, 9, 10]}, auto_bounds=True)
# The average of the ranks of the tied values is assigned to each of the values
ranked = crandas.stats.rankdata(df["a"], method="average")
>>> ranked.open().to_numpy()
array([ 1.5, 1.5, 4. , 4. , 4. , 6.5, 6.5, 8.5, 8.5, 10. , 11.5,
11.5, 13. ])
# The minimum of the ranks of the tied values is assigned to each of the values
>>> ranked = crandas.stats.rankdata(df["a"], method="min")
>>> ranked.open().to_numpy()
array([ 1, 1, 3, 3, 3, 6, 6, 8, 8, 10, 11, 11, 13])
# The maximum of the ranks of the tied values is assigned to each of the values
>>> ranked = crandas.stats.rankdata(df["a"], method="max")
>>> ranked.open().to_numpy()
array([ 2, 2, 5, 5, 5, 7, 7, 9, 9, 10, 12, 12, 13])
Contingency tables
Another tool that is useful when doing data exploration is checking how
many times does each pair of values appear in a table. For
this, we will use the
crosstab()
function. This function takes two columns from the same
table and computes the appearance of each pair of values. The levels
field determines which values will be taken into account. We might only
care about the appearance of some but not all of the values. If they are
not specified, all of the values will be used. The levels parameter is
obligatory for the second column, but not for the first one.
import crandas.contingency
df = cd.DataFrame(
{
"a": [1, 1, 3, 3, 3, 4, 4, 5, 5, 8, 9, 9, 10],
"b": [0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1],
}
)
# The tuple in `levels` includes a value for the first and second column
# Because we want all the values of the first one, we put `None`
# But for the second one, we need to add all values
cont_tab = crandas.stats.contingency.crosstab(df["a"],df["b"],levels=(None,[0,1]))
The result of this table is based on SciPy's CrosstabResult. When
opening the result, we need to know what we are looking at. The pair of
values under elements represent the values in the first and second
columns. They can be interpreted as the row and column labels
respectively. The table in count is the number of appearances of each
pair.
>>> cont_tab.open()
CrosstabResult(elements=([1, 3, 4, 5, 8, 9, 10], [0, 1]), count=
y0 y1
0 1 1
1 0 3
2 2 0
3 1 1
4 0 1
5 2 0
6 0 1
)
In the previous table, the value 3 is in the second row of the second
column. elements contains two lists, first the rows and then the
columns. Because the 3 is in the second row of the second column, then
it corresponds to the second value in each of the lists (3 and 1
respectively). Therefore, there are three instances of the pair (3,1).
# If we change the value in ``levels``, we will get only a subset of the values
>>> crandas.stats.contingency.crosstab(
df["a"], df["b"], levels=([1, 2, 3, 4, 5], [0, 1])
).open()
CrosstabResult(elements=([1, 2, 3, 4, 5], [0, 1]), count=
y0 y1
0 1 1
1 0 0
2 0 3
3 2 0
4 1 1
)
# We always need to add the `levels` value for the second column
>>> crandas.stats.contingency.crosstab(
df["a"], df["b"], levels=([1, 2, 3, 4, 5], None)
).open()
ValueError: The 'levels' argument needs to be set for column 'b'
Contingency tables can be used to do some exploratory data analysis that still respects privacy but can be already useful.
Hypothesis testing
Crandas contains a number of hypothesis testing functionalities. These can be used for more complex data exploration, to see whether certain statistical properties are found in the data. An actual implementation guide can be found in the API guide, but here we list all four tests and what they do:
chisquare(): Pearson's Chi-Square test, tests whether it is likely that observed differences in categorical data arose by chance.chi2_contingency(): Chi-square test of independence on a contingency table.ttest_ind(): T-test, tests whether the difference between the response of two groups is statistically significant or not.kruskal(): Kruskal-Wallis test, tests that the population median of all of the groups are equal.