crandas.stats¶
This module implements hypothesis testing for crandas. The API is based upon
the SciPy.stats module, which is one of the most commonly used statistics
packages in Python.
Supported tests:
stats.chisquare(): Pearson’s Chi-Square test.stats.ttest_ind(): T-teststats.kruskal(): Kruskal-Wallis H-test
See: https://docs.scipy.org/doc/scipy/reference/stats.html
- crandas.stats.chisquare(f_obs, f_exp=None, ddof=0, axis=0, **query_args)¶
Calculate Pearson’s Chi-Square test.
The chi-square test tests the null hypothesis that the categorical data in ‘f_obs’ has the given frequencies of ‘f_exp’.
This function outputs the Pearson test statistic, as well as the associated p-value for a Chi-Squared distribution with k - 1 - ddof degrees of freedom, where k = len(f_obs) denotes be the number of distinct categories in the input sample. Both of these values are returned as public numbers.
f_obs is a secret-shared column containing the observations. f_exp can either be a secret-shared column or an array-like structure (e.g. np.array) containing the expected occurrences (which are fixed point numbers) for each category. It can also be left to None, in which case all categories are assumed to be equally likely, as in SciPy. If specified, f_exp must have the same size as f_obs.
This is the Crandas variant of the related SciPy function scipy.stats.chisquare(). See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
Examples
>>> f_obs = cd.DataFrame({"x": [43, 52, 54, 40]}, ctype={"x": "fp32"})['x'] >>> f_exp = [83.16, 45.36, 54.81, 5.67] >>> cd.stats.chisquare(f_obs, f_exp) ChiSquareResult(statistic=228.23486995697021, pvalue=3.3300384611141438e-49, df=3)
- Parameters:
f_obs (CSeries) – Observed frequencies in each category.
f_exp (CSeries or array-like) – Expected number of occurrences in each category. If None, the categories are assumed to be equally likely.
ddof (int, optional) – “Delta degrees of freedom”: adjustment to the degrees of freedom for the p-value. The p-value is computed using a chi-squared distribution with k - 1 - ddof degrees of freedom, where k is the number of observed frequencies. The default value of ddof is 0.
axis (int or None, optional) – Currently not supported. Needs to be left at 0.
query_args – See Query Arguments
- Returns:
A namedtuple object containing the attributes:
- statisticfloat
The chi-squared test statistic.
- pvaluefloat
The p-value of the test.
- dffloat
The number of degrees of freedom used for the Chi-Squared distribution
- Return type:
ChiSquareResult
- crandas.stats.kruskal(*samples, nan_policy='raise', axis=0, keepdims=False, **query_args)¶
Compute the Kruskal-Wallis H-test for independent samples.
The Kruskal-Wallis H-test tests the null hypothesis that the population median of all of the groups are equal. The test works on 2 or more independent samples, which may have different sizes.
This implementation of the Kruskal-Wallis test takes into account tied values present in the samples.
This is the Crandas variant of the related SciPy function scipy.stats.kruskal(). See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html
Examples
>>> x = cd.DataFrame({"x": [7, 3, 3.2, 1, 5]}, ctype={"x": "fp40"})['x'] >>> y = cd.DataFrame({"x": [3, -5.8, 11.3]}, ctype={"x": "fp40"})['x'] >>> z = cd.DataFrame({"x": [-5.2, 3, 2.1, 6.2, 3.7]}, ctype={"x": "fp40"})['x'] >>> cd.stats.kruskal(x, y, z) KruskalResult(statistic=0.35565662384033203, pvalue=0.8370861288622853, df=2)
- Parameters:
sample1 (CSeries) – Two or more arrays with the sample measurements can be given as arguments. Samples must be one-dimensional.
sample2 (CSeries) – Two or more arrays with the sample measurements can be given as arguments. Samples must be one-dimensional.
… (CSeries) – Two or more arrays with the sample measurements can be given as arguments. Samples must be one-dimensional.
nan_policy (str) – Defines how to handle input NaNs. Currently, only the default ‘raise’ is supported. Note that this means that all samples should be non- nullable columns
axis (int or None, optional) – Currently not supported. Needs to be left at 0.
keepdims (bool) – Currently not supported. Needs to be left at False.
query_args – See Query Arguments
- Returns:
A namedtuple object containing the attributes:
- statisticfloat
The Kruskal-Wallis H statistic, corrected for ties.
- pvaluefloat
The p-value of the test.
- dffloat
The number of degrees of freedom used for the Chi-Squared distribution
- Return type:
KruskalResult
- crandas.stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='raise', permutations=None, random_state=None, alternative='two-sided', trim=0, keepdims=False, **query_args)¶
Calculate the T-test for the means of two independent samples of scores.
This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default. The test for non-equal variances is also supported (this is Welch’s t-test), which can be done by setting equal_var=False. The samples are also permitted to have distinct sizes.
By default, the p-value is determined by comparing the t-statistic of the observed data against a Student t-distribution.
This function outputs the t-test statistic, as well as the associated p-value for a t-distribution with the appropriate degrees of freedom, which is also included in the output. These values are returned as public numbers.
This is the Crandas variant of the related SciPy function scipy.stats.ttest_ind(). See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
Examples
Test with two samples with identical mean:
>>> rvs1 = scipy.stats.norm.rvs(loc=2, scale=4, size=1000) >>> rvs2 = scipy.stats.norm.rvs(loc=2, scale=4, size=1000) >>> a = cd.DataFrame({"x": rvs1}, ctype={"x": "fp32"})['x'] >>> b = cd.DataFrame({"x": rvs2}, ctype={"x": "fp32"})['x'] >>> res = cd.stats.ttest_ind(a, b) >>> res TtestResult(statistic=0.4355583190917969, pvalue=0.6632514017563175, df=998)
As expected, this results in a high p-value. In addition, it is also possible to determine the confidence interval for mean(a) - mean(b):
>>> res.confidence_interval(0.95) ConfidenceInterval(low=-0.9779958724975586, high=1.5385313034057617)
Test with two samples with identical mean but unequal variance:
>>> rvs1 = scipy.stats.norm.rvs(loc=1, scale=7, size=1000) >>> rvs2 = scipy.stats.norm.rvs(loc=1, scale=2, size=1000) >>> a = cd.DataFrame({"x": rvs1}, ctype={"x": "fp[min=-127,max=127]"})['x'] >>> b = cd.DataFrame({"x": rvs2}, ctype={"x": "fp[min=-127,max=127]"})['x'] >>> cd.stats.ttest_ind(a, b, equal_var=False) TtestResult(statistic=-0.271575927734375, pvalue=0.7859958130994953, df=1175.0956583023071)
- Note: the t-test for unequal variances requires tighter bounds (min, max)
on the input values, to avoid an internal numeric overflow.
- Parameters:
a (CSeries) – First sample
b (CSeries) – Second sample
axis (int or None, optional) – Currently not supported. Needs to be left at 0.
equal_var (bool) – If True (default), perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variance.
nan_policy (str) – Defines how to handle input NaNs. Currently, only the default ‘raise’ is supported. Note that this means that both ‘a’ and ‘b’ should be non- nullable columns.
permutations (None) – Currently not supported. Needs to be left at None.
random_state (None) – Currently not supported. Needs to be left at None.
alternative ({
two-sided,less,greater}, optional) –Defines the alternative hypothesis. The following options are available (default is
two-sided):two-sided: the means of the distributions underlying the samples are unequal.less: the mean of the distribution underlying sample ‘a’ is less than the mean of the distribution underlying sample ‘b’greater: the mean of the distribution underlying sample ‘a’ is greater than the mean of the distribution underlying sample ‘b’.
trim (float) – Currently not supported. Needs to be left at 0.
keepdims (bool) – Currently not supported. Needs to be left at False.
query_args – See Query Arguments
- Returns:
An object containing the attributes and methods:
- statisticfloat
The t-statistic.
- pvaluefloat
The p-value associated with the given alternative.
- dffloat
Degrees of freedom used for t-distribution
- confidence_interval(confidence_level=0.95) :
Computes a confidence interval around the difference in population means
(mean(a) - mean(b))for the given confidence level. The confidence interval is returned in a namedtuple with fieldslowandhigh.
- Return type:
TtestResult