GNU Astronomy Utilities


Next: , Previous: , Up: Gnuastro library   [Contents][Index]


10.3.21 Statistical operations (statistics.h)

After reading a dataset into memory from a file or fully simulating it with another process, the most common processes that will be done on it are statistical operations to let you quantify different aspects of the data. the functions in this section describe Gnuastro’s current set of tools for this job. All these functions can work on any numeric data type natively (see Numeric data types) and can also work on tiles over a dataset. Hence the inputs and outputs are in Gnuastro’s Generic data container (gal_data_t).

Macro: GAL_STATISTICS_SIG_CLIP_MAX_CONVERGE

The maximum number of clips, when \(\sigma\)-clipping should be done by convergence. If the clipping does not converge before making this many clips, all sigma-clipping outputs will be NaN.

Macro: GAL_STATISTICS_MODE_GOOD_SYM

The minimum acceptable symmetricity of the mode calculation. If the symmetricity of the derived mode is less than this value, all the returned values by gal_statistics_mode will have a value of NaN.

Macro: GAL_STATISTICS_BINS_INVALID
Macro: GAL_STATISTICS_BINS_REGULAR
Macro: GAL_STATISTICS_BINS_IRREGULAR

Macros used to identify if the regularity of the bins when defining bins.

Function:
gal_data_t *
gal_statistics_number (gal_data_t *input)

Return a single-element dataset with type size_t which contains the number of non-blank elements in input.

Function:
gal_data_t *
gal_statistics_minimum (gal_data_t *input)

Return a single-element dataset containing the minimum non-blank value in input. The numerical datatype of the output is the same as input.

Function:
gal_data_t *
gal_statistics_maximum (gal_data_t *input)

Return a single-element dataset containing the maximum non-blank value in input. The numerical datatype of the output is the same as input.

Function:
gal_data_t *
gal_statistics_sum (gal_data_t *input)

Return a single-element (double or float64) dataset containing the sum of the non-blank values in input.

Function:
gal_data_t *
gal_statistics_mean (gal_data_t *input)

Return a single-element (double or float64) dataset containing the mean of the non-blank values in input.

Function:
gal_data_t *
gal_statistics_std (gal_data_t *input)

Return a single-element (double or float64) dataset containing the standard deviation of the non-blank values in input.

Function:
gal_data_t *
gal_statistics_mean_std (gal_data_t *input)

Return a two-element (double or float64) dataset containing the mean and standard deviation of the non-blank values in input. The first element of the returned dataset is the mean and the second is the standard deviation.

This function will calculate both values in one pass over the dataset. Hence when both the mean and standard deviation of a dataset are necessary, this function is much more efficient than calling gal_statistics_mean and gal_statistics_std separately.

Function:
gal_data_t *
gal_statistics_median (gal_data_t *input, int inplace)

Return a single-element dataset containing the median of the non-blank values in input. The numerical datatype of the output is the same as input.

Calculating the median involves sorting the dataset and removing blank values, for better performance (and less memory usage), you can give a non-zero value to the inplace argument. In this case, the sorting and removal of blank elements will be done directly on the input dataset. However, after this function the original dataset may have changed (if it wasn’t sorted or had blank values).

Function:
size_t
gal_statistics_quantile_index (size_t size, double quantile)

Return the index of the element that has a quantile of quantile assuming the dataset has size elements.

Function:
size_t
gal_statistics_quantile (gal_data_t *input, double quantile, int inplace)

Return a single-element dataset containing the value with in a quantile quantile of the non-blank values in input. The numerical datatype of the output is the same as input. See gal_statistics_median for a description of inplace.

Function:
size_t
gal_statistics_quantile_function_index (gal_data_t *input, gal_data_t *value, int inplace)

Return the index of the quantile function (inverse quantile) of input at value. In other words, this function will return the index of the nearest element (of a sorted and non-blank) input to value. If the value is outside the range of the input, then this function will return GAL_BLANK_SIZE_T.

Function:
gal_data_t *
gal_statistics_quantile_function (gal_data_t *input, gal_data_t *value, int inplace)

Return a single-element (double or float64) dataset containing the quantile function of the non-blank values in input at value. In other words, this function will return the quantile of value in input. If the value is smaller than the input’s smallest element, the returned value will be zero. If the value is larger than the input’s largest element, then the returned value is 1. See gal_statistics_median for a description of inplace.

Function:
gal_data_t *
gal_statistics_mode (gal_data_t *input, float mirrordist, int inplace)

Return a four-element (double or float64) dataset that contains the mode of the input distribution. This function implements the non-parametric algorithm to find the mode that is described in Appendix C of Akhlaghi and Ichikawa [2015].

In short it compares the actual distribution and its “mirror distribution” to find the mode. In order to be efficient, you can determine how far the comparison goes away from the mirror through the mirrordist parameter (think of it as a multiple of sigma/error). See gal_statistics_median for a description of inplace.

The output array has the following elements (in the given order, note that counting in C starts from 0).

array[0]: mode
array[1]: mode quantile.
array[2]: symmetricity.
array[3]: value at the end of symmetricity.
Function:
gal_data_t *
gal_statistics_mode_mirror_plots (gal_data_t *input, gal_data_t *value, size_t numbins, int inplace, double *mirror_val)

Make a mirrored histogram and cumulative frequency plot (with numbins) with the mirror distribution of the input having a value in value. If all the input elements are blank, or the mirror value is outside the range of the input, this function will return a NULL pointer.

The output is a list of data structures (see List of gal_data_t): the first is the bins with one bin at the mirror point, the second is the histogram with a maximum of one and the third is the cumulative frequency plot (with a maximum of one).

Function:
int
gal_statistics_is_sorted (gal_data_t *input, int updateflags)

Return 0 if the input is not sorted, if it is sorted, this function will return 1 and 2 if it is increasing or decreasing, respectively. This function will abort with an error if input has zero elements. This function will only look into the dataset if the GAL_DATA_FLAG_SORT_CH bit of input->flag is 0, see Generic data container (gal_data_t).

When the flags don’t indicate a previous check and updateflags is non-zero, this function will set the flags appropriately to avoid having to re-check the dataset in future calls (this can be very useful when repeated checks are necessary). When updateflags==0, this function has no side-effects on the dataset: it will not toggle the flags.

If you want to re-check a dataset with the blank-value-check flag already set (for example if you have made changes to it), then explicitly set the GAL_DATA_FLAG_SORT_CH bit to zero before calling this function. When there are no other flags, you can simply set the flags to zero (with input->flags=0), otherwise you can use this expression:

input->flags &= ~GAL_DATA_FLAG_SORT_CH;
Function:
void
gal_statistics_sort_increasing (gal_data_t *input)

Sort the input dataset (in place) in an increasing order and toggle the sort-related bit flags accordingly.

Function:
void
gal_statistics_sort_decreasing (gal_data_t *input)

Sort the input dataset (in place) in a decreasing order and toggle the sort-related bit flags accordingly.

Function:
gal_data_t *
gal_statistics_no_blank_sorted (gal_data_t *input, int inplace)

Remove all the blanks and sort the input dataset. If inplace is non-zero this will happen on the input dataset (and the output dataset will be the input dataset). However, if inplace is zero, this function will allocate a new copy of the dataset that is sorted and has no blank values.

This function uses the bit flags of the input, so if you have modified the dataset, set input->flags=0 before calling this function. Also note that inplace is only for the dataset elements. Therefore even when inplace==0, if the input is already sorted and has no blank values, then the flags will be updated to show this.

If all the elements were blank, then the returned dataset’s size will be zero. This is thus a good parameter to check after calling this function to see if there actually were any non-blank elements in the input or not and take the appropriate measure. This can help avoid strange bugs in later steps.

Function:
gal_data_t *
gal_statistics_regular_bins (gal_data_t *input, gal_data_t *inrange, size_t numbins, double onebinstart)

Generate an array of regularly spaced elements as a 1D array (column) of type double (i.e., float64, it has to be double to account for small differences on the bin edges). The input arguments are described below

input

The dataset you want to apply the bins to. This is only necessary if the range argument is not complete, see below. If inrange has all the necessary information, you can pass a NULL pointer for this.

inrange

This dataset keeps the desired range along each dimension of the input data structure, it has to be in float (i.e., float32) type.

  • If you want the full range of the dataset (in any dimensions, then just set inrange to NULL and the range will be specified from the minimum and maximum value of the dataset (input cannot be NULL in this case).
  • If there is one element for each dimension in range, then it is viewed as a quantile (Q), and the range will be: ‘Q to 1-Q’.
  • If there are two elements for each dimension in range, then they are assumed to be your desired minimum and maximum values. When either of the two are NaN, the minimum and maximum will be calculated for it.
numbins

The number of bins: must be larger than 0.

onebinstart

A desired value for onebinstart. Note that with this option, the bins won’t start and end exactly on the given range values, it will be slightly shifted to accommodate this request.

Function:
gal_data_t *
gal_statistics_histogram (gal_data_t *input, gal_data_t *bins, int normalize, int maxone)

Make a histogram of all the elements in the given dataset with bin values that are defined in the inbins structure (see gal_statistics_regular_bins). inbins is not mandatory, if you pass a NULL pointer, the bins structure will be built within this function based on the numbins input. As a result, when you have already defined the bins, numbins is not used.

Function:
gal_data_t *
gal_statistics_cfp (gal_data_t *input, gal_data_t *bins, int normalize)

Make a cumulative frequency plot (CFP) of all the elements in input with bin values that are defined in the bins structure (see gal_statistics_regular_bins).

The CFP is built from the histogram: in each bin, the value is the sum of all previous bins in the histogram. Thus, if you have already calculated the histogram before calling this function, you can pass it onto this function as the data structure in bins->next (see List of gal_data_t). If bin->next!=NULL, then it is assumed to be the histogram. If it is NULL, then the histogram will be calculated internally and freed after the job is finished.

When a histogram is given and it is normalized, the CFP will also be normalized (even if the normalized flag is not set here): note that a normalized CFP’s maximum value is 1.

Function:
gal_data_t *
gal_statistics_sigma_clip (gal_data_t *input, float multip, float param, int inplace, int quiet)

Apply \(\sigma\)-clipping on a given dataset and return a dataset that contains the results. For a description of \(\sigma\)-clipping see Sigma clipping. multip is the multiple of the standard deviation (or \(\sigma\), that is used to define outliers in each round of clipping).

The role of param is determined based on its value. If param is larger than 1 (one), it must be an integer and will be interpreted as the number clips to do. If it is less than 1 (one), it is interpreted as the tolerance level to stop the iteration.

The returned dataset (let’s call it out) contains a four-element array with type GAL_TYPE_FLOAT32. The final number of clips is stored in the out->status.

float *array=out->array;
array[0]: Number of points used.
array[1]: Median.
array[2]: Mean.
array[3]: Standard deviation.

If the \(\sigma\)-clipping doesn’t converge or all input elements are blank, then this function will return NaN values for all the elements above.


Next: , Previous: , Up: Gnuastro library   [Contents][Index]