crandas.experimental.stats¶
This module includes funcionality for hypothesis testing (crandas.experimental.stats.hypothesis_testing
) and contingency tables (crandas.experimental.stats.contingency
).
NOTE: This module is currently experimental. For larger data quantities, the functions in this module can produce inaccurate results. In a future version, these limitations will be repaired and/or documented.
NOTE: The functions from crandas.experimental.stats.hypothesis_testing
are also available by directly importing crandas.experimental.stats
, e.g.,
both as crandas.experimental.stats.hypothesis_testing.kruskal
and as crandas.experimental.stats.kruskal
.
stats.hypothesis_testing¶
This module implements hypothesis testing for crandas. The API is based upon
the SciPy.stats module
, which is one of the most commonly used statistics
packages in Python.
Supported tests:
chisquare()
: Pearson’s Chi-Square test.ttest_ind()
: T-testkruskal()
: Kruskal-Wallis H-test
See: https://docs.scipy.org/doc/scipy/reference/stats.html
(NOTE: these functions are available both as crandas.experimental.stats.<fn>
and
in crandas.experimental.stats.hypothesis_testing.<fn>
.)
- crandas.experimental.stats.hypothesis_testing.chi2_contingency(observed, correction=True, lambda_=None, *, allow_small=False)¶
Perform the Chi-square test of independence on variables in a contingency table
This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table. The expected frequencies are computed based on the marginal sums under the assumption of independence.
- Note: the output structure differs slightly from the output structure of
the related SciPy function, in that the
expected_freq
field is omitted, which could leak significant information if included. If this information is still desired, it can be computed using thecrandas.experimental.stats.contingency.expected_freq()
function.
This is the crandas variant of the related SciPy function
scipy.stats.chi2_contingency()
.Examples
>>> import crandas >>> import crandas.experimental.stats >>> observed = crandas.DataFrame([[10, 10, 20],[20, 20, 20]], columns=['A', 'B', 'C']) >>> crandas.experimental.stats.chi2_contingency(observed) Chi2ContingencyResult(statistic=2.7789535522460938, pvalue=0.24920566087798726, df=2)
The input contingency table can be computed with
crandas.experimental.stats.contingency.crosstab()
:>>> data = np.random.randint(2, size=(5, 3)) >>> data = pd.DataFrame(data, columns=['x', 'y', 'z']) >>> cdf = cd.upload_pandas_dataframe(data, auto_bounds=True) >>> ctab = cd.stats.contingency.crosstab(cdf["x"], cdf["y"], levels=(None, [0,1])) >>> crandas.experimental.stats.chi2_contingency(ctab.count) Chi2ContingencyResult(statistic=0.3125009536743164, pvalue=0.5761495398922512, df=1)
- Parameters:
observed (CDataFrame) – The contingency table. The table contains the observed frequencies (i.e. number of occurrences) in each category. The maximum value for frequences is around 500000. In case of numeric overflow errors, explicitly set a maximum e.g. using
.astype("int[min=0,max=500000]")
correction (bool, optional) – If True, and the degrees of freedom is 1, apply Yates correction for continuity. The effect of the correction is to adjust each observed value by 0.5 towards the corresponding expected value.
lambda (None) – Currently not supported. Needs to be left at None.
allow_small (bool, default: False) – Allow small number of observations (smaller than 5). Note that a minimum of 5 is an often quoted guideline for the application of this test. Passing allow_small=True allows smaller values but results in numeric overflows for large input frequencies.
- Returns:
A
namedtuple
object containing the attributes:statistic (float) - The chi-squared test statistic.
pvalue (float) - The p-value of the test.
- df (float) - The number of degrees of freedom used for the
Chi-Squared distribution
- Return type:
Chi2ContingencyResult
- crandas.experimental.stats.hypothesis_testing.chisquare(f_obs, f_exp=None, ddof=0, axis=0, **query_args)¶
Calculate Pearson’s Chi-Square test.
The chi-square test tests the null hypothesis that the categorical data in
f_obs
has the given frequencies off_exp
.This function outputs the Pearson test statistic, as well as the associated p-value for a Chi-Squared distribution with
k - 1 - ddof
degrees of freedom, wherek = len(f_obs)
denotes be the number of distinct categories in the input sample. Both of these values are returned as public numbers.f_obs
is a secret-shared column containing the observations.f_exp
can either be a secret-shared column or an array-like structure (e.g.np.array
) containing the expected occurrences (which are fixed point numbers) for each category. It can also be left toNone
, in which case all categories are assumed to be equally likely, as in SciPy. If specified,f_exp
must have the same size asf_obs
.This is the crandas variant of the related SciPy function
scipy.stats.chisquare()
.Examples
>>> import crandas as cd >>> import crandas.experimental.stats >>> f_obs = cd.DataFrame({"x": [43, 52, 54, 40]}, ctype={"x": "fp32"})['x'] >>> f_exp = [83.16, 45.36, 54.81, 5.67] >>> crandas.experimental.stats.chisquare(f_obs, f_exp) ChiSquareResult(statistic=228.23486995697021, pvalue=3.3300384611141438e-49, df=3)
- Parameters:
f_obs (CSeries) – Observed frequencies in each category.
f_exp (CSeries or array-like) – Expected number of occurrences in each category. If None, the categories are assumed to be equally likely.
ddof (int, optional) – “Delta degrees of freedom”: adjustment to the degrees of freedom for the p-value. The p-value is computed using a chi-squared distribution with k - 1 - ddof degrees of freedom, where k is the number of observed frequencies. The default value of ddof is 0.
axis (int or None, optional) – Currently not supported. Needs to be left at 0.
query_args – See Query Arguments
- Returns:
A
namedtuple
object containing the attributes:statistic (float) - The chi-squared test statistic.
pvalue (float) - The p-value of the test.
df (float) - The number of degrees of freedom used for the Chi-Squared distribution
- Return type:
ChiSquareResult
- crandas.experimental.stats.hypothesis_testing.kruskal(*samples, nan_policy='raise', axis=0, keepdims=False, allow_duplicates=False, **query_args)¶
Compute the Kruskal-Wallis H-test for independent samples.
The Kruskal-Wallis H-test tests the null hypothesis that the population median of all of the groups are equal. The test works on 2 or more independent samples, which may have different sizes. Very large numbers of samples (e.g., above 300000 samples) are currently not supported.
This implementation of the Kruskal-Wallis test does not take into account tied values present in the samples; see allow_duplicates parameter.
This is the crandas variant of the related SciPy function
scipy.stats.kruskal()
.Examples
>>> import crandas as cd >>> import crandas.experimental.stats >>> x = cd.DataFrame({"x": [7, 3, 3.2, 1, 5]}, ctype={"x": "fp40"})['x'] >>> y = cd.DataFrame({"x": [3.1, -5.8, 11.3]}, ctype={"x": "fp40"})['x'] >>> z = cd.DataFrame({"x": [-5.2, 2.9, 2.1, 6.2, 3.7]}, ctype={"x": "fp40"})['x'] >>> crandas.experimental.stats.kruskal(x, y, z) KruskalResult(statistic=0.4219780219780205, pvalue=0.8097829655462903, df=2)
- Parameters:
sample1 (CSeries) – Two or more arrays with the sample measurements can be given as arguments. Samples must be one-dimensional.
sample2 (CSeries) – Two or more arrays with the sample measurements can be given as arguments. Samples must be one-dimensional.
… (CSeries) – Two or more arrays with the sample measurements can be given as arguments. Samples must be one-dimensional.
nan_policy (str) – Defines how to handle input NaNs. Currently, only the default ‘raise’ is supported. Note that this means that all samples should be non- nullable columns
axis (int or None, optional) – Currently not supported. Needs to be left at 0.
keepdims (bool) – Currently not supported. Needs to be left at False.
allow_duplicates (bool, default: False) – Indicates whether to allow duplicates. If False, the function checks for duplicate inputs and raises an error when they are found. If True, this check is skipped, and duplicate values are ignored when computing the test statistic. This may lead to inaccurate results, especially for large duplicate counts and/or small datasets.
query_args – See Query Arguments
- Returns:
A
namedtuple
object containing the attributes:statistic (float) - The Kruskal-Wallis H statistic, corrected for ties.
pvalue (float) - The p-value of the test.
df (float) - The number of degrees of freedom used for the Chi-Squared distribution
- Return type:
KruskalResult
- crandas.experimental.stats.hypothesis_testing.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='raise', permutations=None, random_state=None, alternative='two-sided', trim=0, keepdims=False, hide_variances=False, **query_args)¶
Calculate the T-test for the means of two independent samples of scores.
This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default. The test for non-equal variances is also supported (this is Welch’s t-test), which can be done by setting
equal_var=False
. The samples are also permitted to have distinct sizes.By default, the p-value is determined by comparing the t-statistic of the observed data against a Student t-distribution.
This function outputs the t-test statistic, as well as the associated p-value for a t-distribution with the appropriate degrees of freedom, which is also included in the output. These values are returned as public numbers.
This is the crandas variant of the related SciPy function
scipy.stats.ttest_ind()
.Examples
Test with two samples with identical mean:
>>> import crandas as cd >>> import crandas.experimental.stats >>> rvs1 = scipy.stats.norm.rvs(loc=2, scale=4, size=1000) >>> rvs2 = scipy.stats.norm.rvs(loc=2, scale=4, size=1000) >>> a = cd.DataFrame({"x": rvs1}, ctype={"x": "fp32"})['x'] >>> b = cd.DataFrame({"x": rvs2}, ctype={"x": "fp32"})['x'] >>> res = crandas.experimental.stats.ttest_ind(a, b) >>> res TtestResult(statistic=0.4355583190917969, pvalue=0.6632514017563175, df=998)
As expected, this results in a high p-value. In addition, it is also possible to determine the confidence interval for mean(a) - mean(b):
>>> res.confidence_interval(0.95) ConfidenceInterval(low=-0.9779958724975586, high=1.5385313034057617)
Test with two samples with identical mean but unequal variance:
>>> rvs1 = scipy.stats.norm.rvs(loc=1, scale=7, size=1000) >>> rvs2 = scipy.stats.norm.rvs(loc=1, scale=2, size=1000) >>> a = cd.DataFrame({"x": rvs1}, ctype={"x": "fp[min=-127,max=127]"})['x'] >>> b = cd.DataFrame({"x": rvs2}, ctype={"x": "fp[min=-127,max=127]"})['x'] >>> crandas.experimental.stats.ttest_ind(a, b, equal_var=False) TtestResult(statistic=-0.271575927734375, pvalue=0.7859958130994953, df=1175.0956583023071)
Note: the t-test for unequal variances requires tighter bounds (min, max) on the input values, to avoid an internal numeric overflow.
- Parameters:
a (CSeries) – First sample
b (CSeries) – Second sample
axis (int or None, optional) – Currently not supported. Needs to be left at 0.
equal_var (bool) – If True (default), perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variance.
nan_policy (str) – Defines how to handle input NaNs. Currently, only the default ‘raise’ is supported. Note that this means that both
a
andb
should be non- nullable columns.permutations (None) – Currently not supported. Needs to be left at
None
.random_state (None) – Currently not supported. Needs to be left at
None
.alternative ({
two-sided
,less
,greater
}, optional) –Defines the alternative hypothesis. The following options are available (default is
two-sided
):two-sided
: the means of the distributions underlying the samples are unequal.less
: the mean of the distribution underlying samplea
is less than the mean of the distribution underlying sampleb
greater
: the mean of the distribution underlying samplea
is greater than the mean of the distribution underlying sampleb
.
trim (float) – Currently not supported. Needs to be left at
0
.keepdims (bool) – Currently not supported. Needs to be left at
False
.hide_variances (bool) – Flag specifying whether the sample variances should remain secret. If set to True, less information is leaked, but there are tighter constraints on the size and bounds of the input samples.
query_args – See Query Arguments
- Returns:
An object containing the attributes and methods:
statistic (float) - The t-statistic.
pvalue (float) - The p-value associated with the given alternative.
df (float) - Degrees of freedom used for t-distribution
- var_a (float) - If hide_variances=False, then the variance of
the first input sample ‘a’. Otherwise math.nan.
- var_b (float) - If hide_variances=False, then the variance of
the second input sample ‘b’. Otherwise math.nan.
confidence_interval(confidence_level=0.95)
:Computes a confidence interval around the difference in population means
(mean(a) - mean(b))
for the given confidence level. The confidence interval is returned in a namedtuple with fieldslow
andhigh
.
- Return type:
TtestResult
stats.contingency¶
Functions for creating and analyzing contingency tables.
Supported functionality:
crosstab()
: Compute contingency table. Currently, only two-dimensional tables, for two columns of the same table, are supported; and the set of possible values of the second column needs to be explicitly specified.expected_freq()
: Compute the expected frequencies from a contingency table.
- class crandas.experimental.stats.contingency.CrosstabResult(elements, count, first_col)¶
Bases:
object
Represents a return value of
crandas.experimental.stats.contingency.crosstab()
.The
elements
field is a 2-tuple representing the values in the first series (as acrandas.crandas.ReturnValue
) and the values in the second series (as a pandas Series).The
count
field is a CDataFrame representing the counts, where the rows correspond to the values in the first series and the columns correspond to the values in the second series. The columns have column namesy0
,y1
, etc.- as_table()¶
Return CDataFrame containing
self.elements[0]
andself.count
The first column corresponds to
self.elements[0]
and has the column name of the original first input column. The remaining columns corresponds to the respective counts for the original second input column. The first column name cannot be a value of the second column.
- open()¶
Open contingency table
The
elements
field is a 2-tuple representing the values in the first and second series (as Pandas Series).The
count
field is a pandas DataFrame representing the counts.
- crandas.experimental.stats.contingency.crosstab(col1, col2, levels)¶
Compute contingency table
See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.contingency.crosstab.html
Returns table of counts for combinations of values of the two provided series.
The arguments must be of type
crandas.crandas.CSeriesColRef
and must be from the same table, e.g.,cdf["name1"], cdf["name2"]
.For the second column, it is necessary to specify the possible values of the column using the
levels
argument. This list of values can be computed by crandas, e.g.,list(cdf.groupby("name2").as_table().open()["name2"]))
; but note that, when using this function in a recorded script, the list of values needs to be the same as when recording the script.Examples
>>> cdf = cd.DataFrame( >>> {"a": [1, 0, 0], "b": [True, False, True], "c": ["any", "value", "will do"]}, >>> auto_bounds=True, >>> ) >>> res = crosstab(cdf["a"], cdf["b"], levels=(None, [False, True]))
- Parameters:
col1 (
crandas.crandas.CSeriesColRef
) – Columns whose unique combinations of values are to be counted. Must be columns of the same table. Bytes columns are currently not supported.col2 (
crandas.crandas.CSeriesColRef
) – Columns whose unique combinations of values are to be counted. Must be columns of the same table. Bytes columns are currently not supported.levels (sequence) – A two-tuple of values of the columns that are to be counted (ignoring all other values). For the first column, if
None
is passed, all values are counted.
- Returns:
Contingency table
- Return type:
- crandas.experimental.stats.contingency.expected_freq(observed, *, _allow_small=True)¶
Compute the expected frequencies from a contingency table.
Given an n-dimensional contingency table of observed frequencies, compute the expected frequencies for the table based on the marginal sums under the assumption that the groups associated with each dimension are independent.
This is the Crandas variant of the related SciPy function scipy.stats.contingency.expected_freq(). See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.contingency.expected_freq.html#scipy.stats.contingency.expected_freq
- Note: this function opens the total number of observations. That is, the
sum of all entries in ‘observed’. This information is necessary for the computation of the expected frequencies, and is normally already public information in any case.
Examples
>>> import crandas as cd >>> from crandas.experimental.stats.contingency import expected_freq >>> observed = cd.DataFrame([[10, 10, 20],[20, 20, 20]], columns=['A', 'B', 'C']) >>> expected = expected_freq(observed) >>> expected.open().to_numpy() array([ [12.00027466, 11.99913025, 16.00036621], [18.00041199, 18.00041199, 24.00054932]])
- Parameters:
observed (CDataFrame) – The table of observed frequencies.
- Returns:
expected – The expected frequencies, based on the marginal sums of the table. Same shape as ‘observed’.
- Return type: