Skip to content

crandas.stats

This module includes functionality for hypothesis testing (crandas.stats.hypothesis_testing), contingency tables (crandas.stats.contingency) and ranking (crandas.stats.ranking).

stats.hypothesis_testing

This module implements hypothesis testing for crandas. The API is based upon the SciPy.stats module, which is one of the most commonly used statistics packages in Python.

Supported tests:

For more information see the SciPy documentation.

Tip

The functions from this module are also available by directly importing crandas.stats, e.g., both as crandas.stats.hypothesis_testing.kruskal and as crandas.stats.kruskal.

Chi2ContingencyResult

Bases: namedtuple('Chi2ContingencyResult', ['statistic', 'pvalue', 'df'])

An object containing the attributes:

ATTRIBUTE DESCRIPTION
statistic

The chi-squared test statistic.

TYPE: float

pvalue

The p-value of the test.

TYPE: float

df

The number of degrees of freedom used for the Chi-Squared distribution

TYPE: int

expected_freq

Expected frequencies for the input under the assumption of independence.

TYPE: DataFrame

__new__(statistic, pvalue, df, expected_freq)

Construct a new Chi2ContingencyResult object

Note: the expected_freq table is secret shared, and therefore not part of the namedtuple structure. Instead, it can be retrieved (and opened) manually from this structure by the user if desired. It is not opened by default since that might leak information.

PARAMETER DESCRIPTION
cls

Result object

statistic

The test statistic, public.

TYPE: float

pvalue

The p-value associated with the t-test, public

TYPE: float

df

Degrees of freedom used for the test, public

TYPE: float

expected_freq

Expected frequencies for the input (secret shared)

TYPE: DataFrame

chi2_contingency(observed, correction=True, lambda_=None, _allow_small=False)

Perform the Chi-square test of independence on variables in a contingency table

This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table. The expected frequencies are computed based on the marginal sums under the assumption of independence.

This test requires that both the observed and expected frequencies in each category are at least 5. If this requirement is not satisfied, a different test should be used instead, the Chisquare test is not appropriate in this situation. To still allow for smaller values, the option _allow_small=True can be set, which allows the cells to be 1.

This is the crandas variant of the related SciPy function scipy.stats.chi2_contingency().

Example

import crandas
import crandas.stats
observed = crandas.DataFrame([[10, 10, 20],[20, 20, 20]], columns=['A', 'B', 'C'], auto_bounds=True)
res = crandas.stats.chi2_contingency(observed)
>>> res
Chi2ContingencyResult(statistic=2.7789535522460938, pvalue=0.24920566087798726, df=2)

The expected frequencies can be accessed through the expected_freq field in the output structure:

>>> res.expected_freq.open().to_numpy()
array([[12., 12., 16.], [18., 18., 24.]])

The input contingency table can be computed with crosstab():

>>> cdf = crandas.DataFrame({"x": [1, 0, 1, 0, 1]*10, "y": [0, 1, 1, 0, 1]*10}, auto_bounds=True)
>>> ctab = cd.stats.contingency.crosstab(cdf["x"], cdf["y"], levels=(None, [0,1]))
>>> crandas.stats.chi2_contingency(ctab.count)
Chi2ContingencyResult(statistic=0.78125, pvalue=0.3767591178115821, df=1)
PARAMETER DESCRIPTION
observed

The contingency table. The table contains the observed frequencies (i.e. number of occurrences) in each category. The maximum value for frequencies is around 500000. In case of numeric overflow errors, explicitly set a maximum e.g. using .astype("int[min=0,max=500000]")

TYPE: DataFrame

correction

If True, and the degrees of freedom is 1, apply Yates correction for continuity. The effect of the correction is to adjust each observed value by 0.5 towards the corresponding expected value.

TYPE: bool DEFAULT: True

lambda_

Currently not supported. Needs to be left at None.

TYPE: `None` DEFAULT: None

RETURNS DESCRIPTION
Chi2ContingencyResult

chisquare(f_obs, f_exp=None, ddof=0, axis=0, _allow_small=False)

Calculate Pearson's Chi-Square test.

The chi-square test tests the null hypothesis that the categorical data in f_obs has the given frequencies of f_exp.

This function outputs the Pearson test statistic, as well as the associated p-value for a Chi-Squared distribution with k - 1 - ddof degrees of freedom, where k = len(f_obs) denotes be the number of distinct categories in the input sample. Both of these values are returned as public numbers.

f_obs is a secret-shared column containing the observations. f_exp can either be a secret-shared column or an array-like structure (e.g. np.array) containing the expected occurrences (which are fixed point numbers) for each category. It can also be left to None, in which case all categories are assumed to be equally likely, as in SciPy. If specified, f_exp must have the same size as f_obs.

This test requires that both the observed and expected frequencies in each category are at least 5. If this requirement is not satisfied, a different test should be used instead, the Chi-square test is not appropriate in this situation. To still allow for smaller values, the option _allow_small=True can be set, which allows the cells to be 1.

This is the crandas variant of the related SciPy function scipy.stats.chisquare().

Example

import crandas as cd
import crandas.stats
f_obs = cd.DataFrame({"x": [43, 52, 54, 40]}, ctype={"x": "fp32"})
f_exp = [83.16, 45.36, 54.81, 5.67]
>>> crandas.stats.chisquare(f_obs["x"], f_exp)
ChiSquareResult(statistic=228.23486995697021, pvalue=3.3300384611141438e-49, df=3)
PARAMETER DESCRIPTION
f_obs

Observed frequencies in each category. Should be non-nullable.

TYPE: CSeries

f_exp

Expected number of occurrences in each category. If None, the categories are assumed to be equally likely. Should not contain any nulls.

TYPE: CSeries or array - like DEFAULT: None

ddof

"Delta degrees of freedom”: adjustment to the degrees of freedom for the p-value. The p-value is computed using a chi-squared distribution with k - 1 - ddof degrees of freedom, where k is the number of observed frequencies

TYPE: int DEFAULT: is 0

axis

Currently not supported. Needs to be left at 0.

TYPE: int or None DEFAULT: is 0

RETURNS DESCRIPTION
ChiSquareResult

A namedtuple object containing the attributes: * statistic (float) - The chi-squared test statistic. * pvalue (float) - The p-value of the test. * df (int) - The number of degrees of freedom used for the Chi-Squared distribution

kruskal(*samples, nan_policy='raise')

Compute the Kruskal-Wallis hypothesis test for independent samples.

The Kruskal-Wallis hypothesis test tests the null hypothesis that the population median of all of the groups are equal. The test works on 2 or more independent samples, which may have different sizes.

This is the crandas variant of the related SciPy function scipy.stats.kruskal().

Note: the input samples are required to contain at least 2 distinct values, otherwise the test is not defined.

Example

import crandas as cd
import crandas.stats
x = cd.DataFrame({"x": [7, 3, 3.2, 1, 5]}, ctype={"x": "fp40"})['x']
y = cd.DataFrame({"x": [3, -5.8, 11.3]}, ctype={"x": "fp40"})['x']
z = cd.DataFrame({"x": [-5.2, 3, 2.1, 6.2, 3.7]}, ctype={"x": "fp40"})['x']
>>> cd.stats.kruskal(x, y, z)
KruskalResult(statistic=0.35555337549565913, pvalue=0.8371293438563217, df=2)
PARAMETER DESCRIPTION
*samples

Two or more arrays with the sample measurements can be given as arguments. Samples must be one-dimensional, non-empty and non-nullable.

TYPE: CSeries DEFAULT: ()

nan_policy

Defines how to handle input NaNs. Currently, only the default 'raise' is supported. Note that this means that all samples should be non- nullable columns

TYPE: str DEFAULT: 'raise'

RETURNS DESCRIPTION
KruskalResult

A namedtuple object containing the attributes: * statistic (float) - The Kruskal-Wallis H statistic, corrected for ties. * pvalue (float) - The p-value of the test. * df (int) - The number of degrees of freedom used for the Chi-Squared distribution

ttest_ind(a, b, equal_var=True, nan_policy='raise', permutations=None, random_state=None, alternative='two-sided', trim=0, hide_variances=False)

Calculate the T-test for the means of two independent samples of scores.

This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default. The test for non-equal variances is also supported (this is Welch's t-test), which can be done by setting equal_var=False. The samples are also permitted to have distinct sizes.

By default, the p-value is determined by comparing the t-statistic of the observed data against a Student t-distribution.

This function outputs the t-test statistic, as well as the associated p-value for a t-distribution with the appropriate degrees of freedom, which is also included in the output. These values are returned as public numbers.

This is the crandas variant of the related SciPy function scipy.stats.ttest_ind().

Example

Test with two samples with identical mean:

import crandas as cd
import crandas.stats
rvs1 = scipy.stats.norm.rvs(loc=2, scale=4, size=1000)
rvs2 = scipy.stats.norm.rvs(loc=2, scale=4, size=1000)
a = cd.DataFrame({"x": rvs1}, ctype={"x": "fp32"})['x']
b = cd.DataFrame({"x": rvs2}, ctype={"x": "fp32"})['x']
res = crandas.stats.ttest_ind(a, b)
>>> res
TtestResult(statistic=0.4355583190917969, pvalue=0.6632514017563175, df=998)

As expected, this results in a high p-value. In addition, it is also possible to determine the confidence interval for mean(a) - mean(b):

>>> res.confidence_interval(0.95)
ConfidenceInterval(low=-0.9779958724975586, high=1.5385313034057617)

Test with two samples with identical mean but unequal variance:

rvs1 = scipy.stats.norm.rvs(loc=1, scale=7, size=1000)
rvs2 = scipy.stats.norm.rvs(loc=1, scale=2, size=1000)
a = cd.DataFrame({"x": rvs1}, ctype={"x": "fp[min=-127,max=127]"})['x']
b = cd.DataFrame({"x": rvs2}, ctype={"x": "fp[min=-127,max=127]"})['x']
>>> crandas.stats.ttest_ind(a, b, equal_var=False)
TtestResult(statistic=-0.271575927734375, pvalue=0.7859958130994953, df=1175.0956583023071)

Note

The t-test for unequal variances requires tighter bounds (min, max) on the input values, to avoid an internal numeric overflow.

PARAMETER DESCRIPTION
a

First sample of numerical values. Should be non-nullable and have at least 2 elements.

TYPE: CSeries

b

Second sample of numerical values. Should be non-nullable and have at least 2 elements.

TYPE: CSeries

equal_var

If True (default), perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch's t-test, which does not assume equal population variance.

TYPE: bool DEFAULT: True

nan_policy

Defines how to handle input NaNs. Currently, only the default 'raise' is supported. Note that this means that both a and b should be non- nullable columns.

TYPE: str DEFAULT: 'raise'

permutations

Currently not supported. Needs to be left at None.

TYPE: None DEFAULT: None

random_state

Currently not supported. Needs to be left at None.

TYPE: None DEFAULT: None

alternative

Defines the alternative hypothesis. The following options are available (default is two-sided):

  • two-sided: the means of the distributions underlying the samples are unequal.
  • less: the mean of the distribution underlying sample a is less than the mean of the distribution underlying sample b
  • greater: the mean of the distribution underlying sample a is greater than the mean of the distribution underlying sample b.

TYPE: `two-sided`, `less`, `greater` DEFAULT: `two-sided`

trim

Currently not supported. Needs to be left at 0.

TYPE: float DEFAULT: 0

hide_variances

Flag specifying whether the sample variances should remain secret. If set to True, less information is leaked, but there are tighter constraints on the size and bounds of the input samples.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
TtestResult

An object containing the attributes and methods:

  • statistic (float) - The t-statistic.
  • pvalue (float) - The p-value associated with the given alternative.
  • df (float) - Degrees of freedom used for t-distribution
  • var_a (float) - If hide_variances=False, then the variance of the first input sample 'a'. Otherwise math.nan.
  • var_b (float) - If hide_variances=False, then the variance of the second input sample 'b'. Otherwise math.nan.
  • confidence_interval(confidence_level=0.95) : Computes a confidence interval around the difference in population means (mean(a) - mean(b)) for the given confidence level. The confidence interval is returned in a namedtuple with fields low and high.

stats.contingency

Functions for creating and analyzing contingency tables.

Supported functionality:

  • crosstab(): Compute contingency table. Currently, only two-dimensional tables, for two columns of the same table, are supported; and the set of possible values of the second column needs to be explicitly specified.
  • expected_freq(): Compute the expected frequencies from a contingency table.

CrosstabResult(elements, count, first_col)

Represents a return value of crandas.stats.contingency.crosstab.

The elements field is a 2-tuple representing the values in the first series (as a crandas.crandas.ReturnValue) and the values in the second series (as a pandas Series).

The count field is a DataFrame representing the counts, where the rows correspond to the values in the first series and the columns correspond to the values in the second series. The columns have column names y0, y1, etc.

as_table()

Return DataFrame containing self.elements[0] and self.count

The first column corresponds to self.elements[0] and has the column name of the original first input column. The remaining columns corresponds to the respective counts for the original second input column. The first column name cannot be a value of the second column.

open()

Open contingency table

The elements field is a 2-tuple representing the values in the first and second series (as Pandas Series).

The count field is a pandas DataFrame representing the counts.

crosstab(col1, col2, levels)

Compute contingency table

See the SciPy documentation for more information.

Returns table of counts for combinations of values of the two provided series.

The arguments must be of type CSeriesColRef and must be from the same table, e.g., cdf["name1"], cdf["name2"].

For the second column, it is necessary to specify the possible values of the column using the levels argument. This list of values can be computed by crandas, e.g., list(cdf.groupby("name2").as_table().open()["name2"])); but note that, when using this function in a recorded script, the list of values needs to be the same as when recording the script.

Example

cdf = cd.DataFrame(
    {"a": [1, 0, 0], "b": [True, False, True], "c": ["any", "value", "will do"]},
    auto_bounds=True,
)
res = crosstab(cdf["a"], cdf["b"], levels=(None, [False, True]))
PARAMETER DESCRIPTION
col1

Columns whose unique combinations of values are to be counted. Must be columns of the same table. Bytes columns are currently not supported.

TYPE: CSeriesColRef

col2

Columns whose unique combinations of values are to be counted. Must be columns of the same table. Bytes columns are currently not supported.

TYPE: CSeriesColRef

levels

A two-tuple of values of the columns that are to be counted (ignoring all other values). For the first column, if None is passed, all values are counted.

RETURNS DESCRIPTION
CrosstabResult

Contingency table

expected_freq(observed, *, _allow_small=False)

Compute the expected frequencies from a contingency table.

Given an n-dimensional contingency table of observed frequencies, compute the expected frequencies for the table based on the marginal sums under the assumption that the groups associated with each dimension are independent.

This is the crandas variant of the related SciPy function scipy.stats.contingency.expected_freq().

Note

This function opens the total number of observations. That is, the sum of all entries in observed. This information is necessary for the computation of the expected frequencies, and is normally already public information in any case.

Tip

By default, this function requires that all cells in observed are at least 5. To allow for smaller values, _allow_small=True can be set, which allows the cells to be 1. Observe that zeroes are explicitly not supported.

Example

import crandas as cd
from crandas.stats.contingency import expected_freq
observed = cd.DataFrame([[10, 10, 20],[20, 20, 20]], columns=['A', 'B', 'C'])
expected = expected_freq(observed)
>>> expected.open().to_numpy()
array([ [12.00027466, 11.99913025, 16.00036621],
        [18.00041199, 18.00041199, 24.00054932]])
PARAMETER DESCRIPTION
observed

The table of observed frequencies. Must be non-empty, and all columns must be non-nullable.

TYPE: DataFrame

RETURNS DESCRIPTION
expected

The expected frequencies, based on the marginal sums of the table. Same shape as 'observed'.

TYPE: DataFrame

stats.ranking

Functions for determining ranking of data and handling ties

Supported functionality:

  • rankdata(): Compute ranks of list of numbers, handling ties properly
  • tiecorrect(): Compute the tie correction factor as used in the Kruskal-Wallis hypothesis test.

rankdata(a, method='average')

Determine the rank of each element in 'a', dealing with ties appropriately.

Ranks start at 1. The method argument controls how ranks are assigned to equal values

This is the crandas variant of the related SciPy function scipy.stats.rankdata() .

Example

import crandas as cd
import crandas.stats
df = cd.DataFrame({"a": [5, 3, 4, 5, 3, 8, 9, 10, 3, 9, 1, 4]}, auto_bounds=True)
ranked = crandas.stats.rankdata(df["a"], method="average")
>>> ranked.open().to_numpy()
array([7.5, 3.0, 5.5, 7.5, 3.0, 9.0, 10.5, 12.0, 3.0, 10.5, 1.0, 5.5])
PARAMETER DESCRIPTION
a

Array of (numeric) values to be ranked. Should be non-nullable.

TYPE: CSeries

method

Defines how ranks should be assigned to tied elements. The following options are available (default is average):

  • average: The average of the ranks of the tied values is assigned to each of the values
  • min: The minimum of the ranks of the tied values is assigned to each of the values
  • max: The maximum of the ranks of the tied values is assigned to each of the values

Note that the options dense and ordinal, which are available in SciPy, are currently not supported by Crandas.

TYPE: str DEFAULT: 'average'

RETURNS DESCRIPTION
rankvals

A CSeries object with the same size as 'a', containing the ranks assigned to each of the values. These are numbers between 1 and len(a), inclusive.

TYPE: CSeries

tiecorrect(rankvals)

Determine the tie-correction factor for the Kruskal-Wallis hypothesis test.

This factor is also applicable to other statistical tests, such as the Mann-Whitney U test.

This is the crandas variant of the related SciPy function scipy.stats.tiecorrect() .

Note: this implementation requires that the input is non-empty, and that there are at least 2 distinct values. Otherwise, the factor is not defined (due to division by 0). This behaviour differs from that of SciPy, which does permit these cases, but immediately returns 1.0.

Example

import crandas as cd
import crandas.stats
df = cd.DataFrame({"a": [5, 3, 4, 5, 3, 8, 9, 10, 3, 9, 1, 4]}, auto_bounds=True)
ranks = crandas.stats.rankdata(df["a"])
ties = crandas.stats.tiecorrect(ranks)
>>> ties.open()
0.97552490234375
PARAMETER DESCRIPTION
rankvals

Array of (numeric) ranks. Should be non-nullable and non-empty. Also, it should contain at least 2 distinct values.

Typically this is the array returned by the function rankdata().

TYPE: CSeries

RETURNS DESCRIPTION
factor

The (secret-shared) correction factor. A number between 0.0 and 1.0

TYPE: ReturnValue