Go to the first, previous, next, last section, table of contents.

This chapter describes the statistical functions in the library. The basic statistical functions include routines to compute the mean, variance and standard deviation. More advanced functions allow you to calculate absolute deviations, skewness, and kurtosis as well as the median and arbitrary percentiles. The algorithms use recurrence relations to compute average quantities in a stable way, without large intermediate values that might overflow.

The functions are available in versions for datasets in the standard
floating-point and integer types. The versions for double precision
floating-point data have the prefix `gsl_stats`

and are declared in
the header file ``gsl_statistics_double.h'`. The versions for integer
data have the prefix `gsl_stats_int`

and are declared in the header
file ``gsl_statistics_int.h'`. All the functions operate on C
arrays with a `stride` parameter specifying the spacing between
elements.

__Function:__double**gsl_stats_mean***(const double*`data`[], size_t`stride`, size_t`n`)-
This function returns the arithmetic mean of
`data`, a dataset of length`n`with stride`stride`. The arithmetic mean, or*sample mean*, is denoted by \Hat\mu and defined as,\Hat\mu = (1/N) \sum x_i

where x_i are the elements of the dataset

`data`. For samples drawn from a gaussian distribution the variance of \Hat\mu is \sigma^2 / N.

__Function:__double**gsl_stats_variance***(const double*`data`[], size_t`stride`, size_t`n`)-
This function returns the estimated, or
*sample*, variance of`data`, a dataset of length`n`with stride`stride`. The estimated variance is denoted by \Hat\sigma^2 and is defined by,\Hat\sigma^2 = (1/(N-1)) \sum (x_i - \Hat\mu)^2

where x_i are the elements of the dataset

`data`. Note that the normalization factor of 1/(N-1) results from the derivation of \Hat\sigma^2 as an unbiased estimator of the population variance \sigma^2. For samples drawn from a Gaussian distribution the variance of \Hat\sigma^2 itself is 2 \sigma^4 / N.This function computes the mean via a call to

`gsl_stats_mean`

. If you have already computed the mean then you can pass it directly to`gsl_stats_variance_m`

.

__Function:__double**gsl_stats_variance_m***(const double*`data`[], size_t`stride`, size_t`n`, double`mean`)-
This function returns the sample variance of
`data`relative to the given value of`mean`. The function is computed with \Hat\mu replaced by the value of`mean`that you supply,\Hat\sigma^2 = (1/(N-1)) \sum (x_i - mean)^2

__Function:__double**gsl_stats_sd***(const double*`data`[], size_t`stride`, size_t`n`)__Function:__double**gsl_stats_sd_m***(const double*`data`[], size_t`stride`, size_t`n`, double`mean`)- The standard deviation is defined as the square root of the variance. These functions return the square root of the corresponding variance functions above.

__Function:__double**gsl_stats_tss***(const double*`data`[], size_t`stride`, size_t`n`)__Function:__double**gsl_stats_tss_m***(const double*`data`[], size_t`stride`, size_t`n`, double`mean`)-
These functions return the total sum of squares (TSS) of
`data`about the mean. For`gsl_stats_tss_m`

the user-supplied value of`mean`is used, and for`gsl_stats_tss`

it is computed using`gsl_stats_mean`

.TSS = \sum (x_i - mean)^2

__Function:__double**gsl_stats_variance_with_fixed_mean***(const double*`data`[], size_t`stride`, size_t`n`, double`mean`)-
This function computes an unbiased estimate of the variance of
`data`when the population mean`mean`of the underlying distribution is known*a priori*. In this case the estimator for the variance uses the factor 1/N and the sample mean \Hat\mu is replaced by the known population mean \mu,\Hat\sigma^2 = (1/N) \sum (x_i - \mu)^2

__Function:__double**gsl_stats_sd_with_fixed_mean***(const double*`data`[], size_t`stride`, size_t`n`, double`mean`)-
This function calculates the standard deviation of
`data`for a fixed population mean`mean`. The result is the square root of the corresponding variance function.

__Function:__double**gsl_stats_absdev***(const double*`data`[], size_t`stride`, size_t`n`)-
This function computes the absolute deviation from the mean of
`data`, a dataset of length`n`with stride`stride`. The absolute deviation from the mean is defined as,absdev = (1/N) \sum |x_i - \Hat\mu|

where x_i are the elements of the dataset

`data`. The absolute deviation from the mean provides a more robust measure of the width of a distribution than the variance. This function computes the mean of`data`via a call to`gsl_stats_mean`

.

__Function:__double**gsl_stats_absdev_m***(const double*`data`[], size_t`stride`, size_t`n`, double`mean`)-
This function computes the absolute deviation of the dataset
`data`relative to the given value of`mean`,absdev = (1/N) \sum |x_i - mean|

This function is useful if you have already computed the mean of

`data`(and want to avoid recomputing it), or wish to calculate the absolute deviation relative to another value (such as zero, or the median).

__Function:__double**gsl_stats_skew***(const double*`data`[], size_t`stride`, size_t`n`)-
This function computes the skewness of
`data`, a dataset of length`n`with stride`stride`. The skewness is defined as,skew = (1/N) \sum ((x_i - \Hat\mu)/\Hat\sigma)^3

where x_i are the elements of the dataset

`data`. The skewness measures the asymmetry of the tails of a distribution.The function computes the mean and estimated standard deviation of

`data`via calls to`gsl_stats_mean`

and`gsl_stats_sd`

.

__Function:__double**gsl_stats_skew_m_sd***(const double*`data`[], size_t`stride`, size_t`n`, double`mean`, double`sd`)-
This function computes the skewness of the dataset
`data`using the given values of the mean`mean`and standard deviation`sd`,skew = (1/N) \sum ((x_i - mean)/sd)^3

These functions are useful if you have already computed the mean and standard deviation of

`data`and want to avoid recomputing them.

__Function:__double**gsl_stats_kurtosis***(const double*`data`[], size_t`stride`, size_t`n`)-
This function computes the kurtosis of
`data`, a dataset of length`n`with stride`stride`. The kurtosis is defined as,kurtosis = ((1/N) \sum ((x_i - \Hat\mu)/\Hat\sigma)^4) - 3

The kurtosis measures how sharply peaked a distribution is, relative to its width. The kurtosis is normalized to zero for a Gaussian distribution.

__Function:__double**gsl_stats_kurtosis_m_sd***(const double*`data`[], size_t`stride`, size_t`n`, double`mean`, double`sd`)-
This function computes the kurtosis of the dataset
`data`using the given values of the mean`mean`and standard deviation`sd`,kurtosis = ((1/N) \sum ((x_i - mean)/sd)^4) - 3

This function is useful if you have already computed the mean and standard deviation of

`data`and want to avoid recomputing them.

__Function:__double**gsl_stats_lag1_autocorrelation***(const double*`data`[], const size_t`stride`, const size_t`n`)-
This function computes the lag-1 autocorrelation of the dataset
`data`.a_1 = {\sum_{i = 1}^{n} (x_{i} - \Hat\mu) (x_{i-1} - \Hat\mu) \over \sum_{i = 1}^{n} (x_{i} - \Hat\mu) (x_{i} - \Hat\mu)}

__Function:__double**gsl_stats_lag1_autocorrelation_m***(const double*`data`[], const size_t`stride`, const size_t`n`, const double`mean`)-
This function computes the lag-1 autocorrelation of the dataset
`data`using the given value of the mean`mean`.

__Function:__double**gsl_stats_covariance***(const double*`data1`[], const size_t`stride1`, const double`data2`[], const size_t`stride2`, const size_t`n`)-
This function computes the covariance of the datasets
`data1`and`data2`which must both be of the same length`n`.covar = (1/(n - 1)) \sum_{i = 1}^{n} (x_i - \Hat x) (y_i - \Hat y)

__Function:__double**gsl_stats_covariance_m***(const double*`data1`[], const size_t`stride1`, const double`data2`[], const size_t`stride2`, const size_t`n`, const double`mean1`, const double`mean2`)-
This function computes the covariance of the datasets
`data1`and`data2`using the given values of the means,`mean1`and`mean2`. This is useful if you have already computed the means of`data1`and`data2`and want to avoid recomputing them.

__Function:__double**gsl_stats_correlation***(const double*`data1`[], const size_t`stride1`, const double`data2`[], const size_t`stride2`, const size_t`n`)-
This function efficiently computes the Pearson correlation coefficient
between the datasets
`data1`and`data2`which must both be of the same length`n`.r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y) = {1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y) \over \sqrt{1/(n-1) \sum (x_i - \Hat x)^2} \sqrt{1/(n-1) \sum (y_i - \Hat y)^2} }

__Function:__double**gsl_stats_spearman***(const double*`data1`[], const size_t`stride1`, const double`data2`[], const size_t`stride2`, const size_t`n`, double`work`[])-
This function computes the Spearman rank correlation coefficient between
the datasets
`data1`and`data2`which must both be of the same length`n`. Additional workspace of size 2*`n`is required in`work`. The Spearman rank correlation between vectors x and y is equivalent to the Pearson correlation between the ranked vectors x_R and y_R, where ranks are defined to be the average of the positions of an element in the ascending order of the values.

The functions described in this section allow the computation of statistics for weighted samples. The functions accept an array of samples, x_i, with associated weights, w_i. Each sample x_i is considered as having been drawn from a Gaussian distribution with variance \sigma_i^2. The sample weight w_i is defined as the reciprocal of this variance, w_i = 1/\sigma_i^2. Setting a weight to zero corresponds to removing a sample from a dataset.

__Function:__double**gsl_stats_wmean***(const double*`w`[], size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`)-
This function returns the weighted mean of the dataset
`data`with stride`stride`and length`n`, using the set of weights`w`with stride`wstride`and length`n`. The weighted mean is defined as,\Hat\mu = (\sum w_i x_i) / (\sum w_i)

__Function:__double**gsl_stats_wvariance***(const double*`w`[], size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`)-
This function returns the estimated variance of the dataset
`data`with stride`stride`and length`n`, using the set of weights`w`with stride`wstride`and length`n`. The estimated variance of a weighted dataset is calculated as,\Hat\sigma^2 = ((\sum w_i)/((\sum w_i)^2 - \sum (w_i^2))) \sum w_i (x_i - \Hat\mu)^2

Note that this expression reduces to an unweighted variance with the familiar 1/(N-1) factor when there are N equal non-zero weights.

__Function:__double**gsl_stats_wvariance_m***(const double*`w`[], size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`, double`wmean`)-
This function returns the estimated variance of the weighted dataset
`data`using the given weighted mean`wmean`.

__Function:__double**gsl_stats_wsd***(const double*`w`[], size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`)-
The standard deviation is defined as the square root of the variance.
This function returns the square root of the corresponding variance
function
`gsl_stats_wvariance`

above.

__Function:__double**gsl_stats_wsd_m***(const double*`w`[], size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`, double`wmean`)-
This function returns the square root of the corresponding variance
function
`gsl_stats_wvariance_m`

above.

__Function:__double**gsl_stats_wvariance_with_fixed_mean***(const double*`w`[], size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`, const double`mean`)-
This function computes an unbiased estimate of the variance of the weighted
dataset
`data`when the population mean`mean`of the underlying distribution is known*a priori*. In this case the estimator for the variance replaces the sample mean \Hat\mu by the known population mean \mu,\Hat\sigma^2 = (\sum w_i (x_i - \mu)^2) / (\sum w_i)

__Function:__double**gsl_stats_wsd_with_fixed_mean***(const double*`w`[], size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`, const double`mean`)- The standard deviation is defined as the square root of the variance. This function returns the square root of the corresponding variance function above.

__Function:__double**gsl_stats_wtss***(const double*`w`[], const size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`)__Function:__double**gsl_stats_wtss_m***(const double*`w`[], const size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`, double`wmean`)-
These functions return the weighted total sum of squares (TSS) of
`data`about the weighted mean. For`gsl_stats_wtss_m`

the user-supplied value of`wmean`is used, and for`gsl_stats_wtss`

it is computed using`gsl_stats_wmean`

.TSS = \sum w_i (x_i - wmean)^2

__Function:__double**gsl_stats_wabsdev***(const double*`w`[], size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`)-
This function computes the weighted absolute deviation from the weighted
mean of
`data`. The absolute deviation from the mean is defined as,absdev = (\sum w_i |x_i - \Hat\mu|) / (\sum w_i)

__Function:__double**gsl_stats_wabsdev_m***(const double*`w`[], size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`, double`wmean`)-
This function computes the absolute deviation of the weighted dataset
`data`about the given weighted mean`wmean`.

__Function:__double**gsl_stats_wskew***(const double*`w`[], size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`)-
This function computes the weighted skewness of the dataset
`data`.skew = (\sum w_i ((x_i - \Hat x)/\Hat \sigma)^3) / (\sum w_i)

__Function:__double**gsl_stats_wskew_m_sd***(const double*`w`[], size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`, double`wmean`, double`wsd`)-
This function computes the weighted skewness of the dataset
`data`using the given values of the weighted mean and weighted standard deviation,`wmean`and`wsd`.

__Function:__double**gsl_stats_wkurtosis***(const double*`w`[], size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`)-
This function computes the weighted kurtosis of the dataset
`data`.kurtosis = ((\sum w_i ((x_i - \Hat x)/\Hat \sigma)^4) / (\sum w_i)) - 3

__Function:__double**gsl_stats_wkurtosis_m_sd***(const double*`w`[], size_t`wstride`, const double`data`[], size_t`stride`, size_t`n`, double`wmean`, double`wsd`)-
This function computes the weighted kurtosis of the dataset
`data`using the given values of the weighted mean and weighted standard deviation,`wmean`and`wsd`.

The following functions find the maximum and minimum values of a
dataset (or their indices). If the data contains `NaN`

s then a
`NaN`

will be returned, since the maximum or minimum value is
undefined. For functions which return an index, the location of the
first `NaN`

in the array is returned.

__Function:__double**gsl_stats_max***(const double*`data`[], size_t`stride`, size_t`n`)-
This function returns the maximum value in
`data`, a dataset of length`n`with stride`stride`. The maximum value is defined as the value of the element x_i which satisfies x_i >= x_j for all j.If you want instead to find the element with the largest absolute magnitude you will need to apply

`fabs`

or`abs`

to your data before calling this function.

__Function:__double**gsl_stats_min***(const double*`data`[], size_t`stride`, size_t`n`)-
This function returns the minimum value in
`data`, a dataset of length`n`with stride`stride`. The minimum value is defined as the value of the element x_i which satisfies x_i <= x_j for all j.If you want instead to find the element with the smallest absolute magnitude you will need to apply

`fabs`

or`abs`

to your data before calling this function.

__Function:__void**gsl_stats_minmax***(double **`min`, double *`max`, const double`data`[], size_t`stride`, size_t`n`)-
This function finds both the minimum and maximum values
`min`,`max`in`data`in a single pass.

__Function:__size_t**gsl_stats_max_index***(const double*`data`[], size_t`stride`, size_t`n`)-
This function returns the index of the maximum value in
`data`, a dataset of length`n`with stride`stride`. The maximum value is defined as the value of the element x_i which satisfies x_i >= x_j for all j. When there are several equal maximum elements then the first one is chosen.

__Function:__size_t**gsl_stats_min_index***(const double*`data`[], size_t`stride`, size_t`n`)-
This function returns the index of the minimum value in
`data`, a dataset of length`n`with stride`stride`. The minimum value is defined as the value of the element x_i which satisfies x_i >= x_j for all j. When there are several equal minimum elements then the first one is chosen.

__Function:__void**gsl_stats_minmax_index***(size_t **`min_index`, size_t *`max_index`, const double`data`[], size_t`stride`, size_t`n`)-
This function returns the indexes
`min_index`,`max_index`of the minimum and maximum values in`data`in a single pass.

The median and percentile functions described in this section operate on
sorted data. For convenience we use *quantiles*, measured on a scale
of 0 to 1, instead of percentiles (which use a scale of 0 to 100).

__Function:__double**gsl_stats_median_from_sorted_data***(const double*`sorted_data`[], size_t`stride`, size_t`n`)-
This function returns the median value of
`sorted_data`, a dataset of length`n`with stride`stride`. The elements of the array must be in ascending numerical order. There are no checks to see whether the data are sorted, so the function`gsl_sort`

should always be used first.When the dataset has an odd number of elements the median is the value of element (n-1)/2. When the dataset has an even number of elements the median is the mean of the two nearest middle values, elements (n-1)/2 and n/2. Since the algorithm for computing the median involves interpolation this function always returns a floating-point number, even for integer data types.

__Function:__double**gsl_stats_quantile_from_sorted_data***(const double*`sorted_data`[], size_t`stride`, size_t`n`, double`f`)-
This function returns a quantile value of
`sorted_data`, a double-precision array of length`n`with stride`stride`. The elements of the array must be in ascending numerical order. The quantile is determined by the`f`, a fraction between 0 and 1. For example, to compute the value of the 75th percentile`f`should have the value 0.75.There are no checks to see whether the data are sorted, so the function

`gsl_sort`

should always be used first.The quantile is found by interpolation, using the formula

quantile = (1 - \delta) x_i + \delta x_{i+1}

where i is

`floor`

((n - 1)f) and \delta is (n-1)f - i.Thus the minimum value of the array (

`data[0*stride]`

) is given by`f`equal to zero, the maximum value (`data[(n-1)*stride]`

) is given by`f`equal to one and the median value is given by`f`equal to 0.5. Since the algorithm for computing quantiles involves interpolation this function always returns a floating-point number, even for integer data types.

Here is a basic example of how to use the statistical functions:

#include <stdio.h> #include <gsl/gsl_statistics.h> int main(void) { double data[5] = {17.2, 18.1, 16.5, 18.3, 12.6}; double mean, variance, largest, smallest; mean = gsl_stats_mean(data, 1, 5); variance = gsl_stats_variance(data, 1, 5); largest = gsl_stats_max(data, 1, 5); smallest = gsl_stats_min(data, 1, 5); printf ("The dataset is %g, %g, %g, %g, %g\n", data[0], data[1], data[2], data[3], data[4]); printf ("The sample mean is %g\n", mean); printf ("The estimated variance is %g\n", variance); printf ("The largest value is %g\n", largest); printf ("The smallest value is %g\n", smallest); return 0; }

The program should produce the following output,

The dataset is 17.2, 18.1, 16.5, 18.3, 12.6 The sample mean is 16.54 The estimated variance is 5.373 The largest value is 18.3 The smallest value is 12.6

Here is an example using sorted data,

#include <stdio.h> #include <gsl/gsl_sort.h> #include <gsl/gsl_statistics.h> int main(void) { double data[5] = {17.2, 18.1, 16.5, 18.3, 12.6}; double median, upperq, lowerq; printf ("Original dataset: %g, %g, %g, %g, %g\n", data[0], data[1], data[2], data[3], data[4]); gsl_sort (data, 1, 5); printf ("Sorted dataset: %g, %g, %g, %g, %g\n", data[0], data[1], data[2], data[3], data[4]); median = gsl_stats_median_from_sorted_data (data, 1, 5); upperq = gsl_stats_quantile_from_sorted_data (data, 1, 5, 0.75); lowerq = gsl_stats_quantile_from_sorted_data (data, 1, 5, 0.25); printf ("The median is %g\n", median); printf ("The upper quartile is %g\n", upperq); printf ("The lower quartile is %g\n", lowerq); return 0; }

This program should produce the following output,

Original dataset: 17.2, 18.1, 16.5, 18.3, 12.6 Sorted dataset: 12.6, 16.5, 17.2, 18.1, 18.3 The median is 17.2 The upper quartile is 18.1 The lower quartile is 16.5

The standard reference for almost any topic in statistics is the multi-volume Advanced Theory of Statistics by Kendall and Stuart.

- Maurice Kendall, Alan Stuart, and J. Keith Ord. The Advanced Theory of Statistics (multiple volumes) reprinted as Kendall's Advanced Theory of Statistics. Wiley, ISBN 047023380X.

Many statistical concepts can be more easily understood by a Bayesian approach. The following book by Gelman, Carlin, Stern and Rubin gives a comprehensive coverage of the subject.

- Andrew Gelman, John B. Carlin, Hal S. Stern, Donald B. Rubin. Bayesian Data Analysis. Chapman & Hall, ISBN 0412039915.

For physicists the Particle Data Group provides useful reviews of Probability and Statistics in the "Mathematical Tools" section of its Annual Review of Particle Physics.

- Review of Particle Properties R.M. Barnett et al., Physical Review D54, 1 (1996)

The Review of Particle Physics is available online at
the website `http://pdg.lbl.gov/`.

Go to the first, previous, next, last section, table of contents.