After reading a dataset into memory from a file or fully simulating it with another process, the most common processes that will be done on it are statistical operations to let you quantify different aspects of the data.
the functions in this section describe Gnuastro’s current set of tools for this job.
All these functions can work on any numeric data type natively (see Numeric data types) and can also work on tiles over a dataset.
Hence the inputs and outputs are in Gnuastro’s Generic data container (`gal_data_t`

).

- Macro:
**GAL_STATISTICS_SIG_CLIP_MAX_CONVERGE**¶ The maximum number of clips, when \(\sigma\)-clipping should be done by convergence. If the clipping does not converge before making this many clips, all \(\sigma\)-clipping outputs will be NaN.

- Macro:
**GAL_STATISTICS_MODE_GOOD_SYM**¶ The minimum acceptable symmetricity of the mode calculation. If the symmetricity of the derived mode is less than this value, all the returned values by

`gal_statistics_mode`

will have a value of NaN.

- Macro:
**GAL_STATISTICS_BINS_INVALID**¶ - Macro:
**GAL_STATISTICS_BINS_REGULAR**¶ - Macro:
**GAL_STATISTICS_BINS_IRREGULAR**¶ Macros used to identify if the regularity of the bins when defining bins.

- Macro:
**GAL_STATISTICS_CLIP_OUTCOL_STD**¶ - Macro:
**GAL_STATISTICS_CLIP_OUTCOL_MAD**¶ - Macro:
**GAL_STATISTICS_CLIP_OUTCOL_MEAN**¶ - Macro:
**GAL_STATISTICS_CLIP_OUTCOL_MEDIAN**¶ - Macro:
**GAL_STATISTICS_CLIP_OUTCOL_NUMBER_USED**¶ - Macro:
**GAL_STATISTICS_CLIP_OUTCOL_NUMBER_CLIPS**¶ Macros containing the index of the clipping outputs, see the descriptions of

`gal_statistics_clip_sigma`

below.

- Macro:
**GAL_STATISTICS_CLIP_OUTCOL_OPTIONAL_STD**¶ - Macro:
**GAL_STATISTICS_CLIP_OUTCOL_OPTIONAL_MAD**¶ - Macro:
**GAL_STATISTICS_CLIP_OUTCOL_OPTIONAL_MEAN**¶ Macros containing bit flags for optional clipping outputs, see the descriptions of

`gal_statistics_clip_sigma`

below.

- Function:

`gal_data_t *`

**gal_statistics_number**`(gal_data_t`

¶`*input`

) Return a single-element dataset with type

`size_t`

which contains the number of non-blank elements in`input`

.

- Function:

`gal_data_t *`

**gal_statistics_minimum**`(gal_data_t`

¶`*input`

) Return a single-element dataset containing the minimum non-blank value in

`input`

. The numerical datatype of the output is the same as`input`

.

- Function:

`gal_data_t *`

**gal_statistics_maximum**`(gal_data_t`

¶`*input`

) Return a single-element dataset containing the maximum non-blank value in

`input`

. The numerical datatype of the output is the same as`input`

.

- Function:

`gal_data_t *`

**gal_statistics_sum**`(gal_data_t`

¶`*input`

) Return a single-element (

`double`

or`float64`

) dataset containing the sum of the non-blank values in`input`

.

- Function:

`gal_data_t *`

**gal_statistics_mean**`(gal_data_t`

¶`*input`

) Return a single-element (

`double`

or`float64`

) dataset containing the mean of the non-blank values in`input`

.

- Function:

`gal_data_t *`

**gal_statistics_std**`(gal_data_t`

¶`*input`

) Return a single-element (

`double`

or`float64`

) dataset containing the standard deviation of the non-blank values in`input`

.

- Function:

`gal_data_t *`

**gal_statistics_mean_std**`(gal_data_t`

¶`*input`

) Return a two-element (

`double`

or`float64`

) dataset containing the mean and standard deviation of the non-blank values in`input`

. The first element of the returned dataset is the mean and the second is the standard deviation.This function will calculate both values in one pass over the dataset. Hence when both the mean and standard deviation of a dataset are necessary, this function is much more efficient than calling

`gal_statistics_mean`

and`gal_statistics_std`

separately.

- Function:

`double`

**gal_statistics_std_from_sums**`(double`

¶`sum`

, double`sump2`

, size_t`num`

) Return the standard deviation from the values that can be obtained in a single pass through the distribution:

`sum`

: the sum of the elements,`sump2`

: the sum of the power-of-2 of each element, and`num`

: the number of elements.This is a low-level function that is only useful after the distribution of values has been parsed (and the three input arguments are calculated). It is the lower-level function that is used in functions like

`gal_statistics_std`

, or other components of Gnuastro that measure the standard deviation (for example, MakeCatalog’s`--std`column).

- Function:

`gal_data_t *`

**gal_statistics_median**`(gal_data_t`

¶`*input`

, int`inplace`

) Return a single-element dataset containing the median of the non-blank values in

`input`

. The numerical datatype of the output is the same as`input`

.Calculating the median involves sorting the dataset and removing blank values, for better performance (and less memory usage), you can give a non-zero value to the

`inplace`

argument. In this case, the sorting and removal of blank elements will be done directly on the input dataset. However, after this function the original dataset may have changed (if it was not sorted or had blank values).

- Function:

`gal_data_t *`

**gal_statistics_mad**`(gal_data_t`

¶`*input`

, int`inplace`

) Return a single-element dataset with same type as input, containing the median absolute deviation (MAD) of the non-blank values in

`input`

.If

`inplace==0`

, the input dataset will remain untouched. Otherwise, the MAD calculation will be done on the input dataset without allocating a new one (its values will be changed after this function). This is good when you do not need the input after this function and avoid taking extra RAM and CPU.

- Function:

`gal_data_t *`

**gal_statistics_median_mad**`(gal_data_t`

¶`*input`

, int`inplace`

) Return a two-element dataset with same type as input, containing the median and median absolute deviation (MAD) of the non-blank values in

`input`

.If

`inplace==0`

, the input dataset will remain untouched. Otherwise, the MAD calculation will be done on the input dataset without allocating a new one (its values will be changed after this function). This is good when you do not need the input after this function and avoid taking extra RAM and CPU.

- Function:

`size_t`

**gal_statistics_quantile_index**`(size_t`

¶`size`

, double`quantile`

) Return the index of the element that has a quantile of

`quantile`

assuming the dataset has`size`

elements.

- Function:

`gal_data_t *`

**gal_statistics_quantile**`(gal_data_t`

¶`*input`

, double`quantile`

, int`inplace`

) Return a single-element dataset containing the value with in a quantile

`quantile`

of the non-blank values in`input`

. The numerical datatype of the output is the same as`input`

. See`gal_statistics_median`

for a description of`inplace`

.

- Function:

`size_t`

**gal_statistics_quantile_function_index**`(gal_data_t`

¶`*input`

, gal_data_t`*value`

, int`inplace`

) Return the index of the quantile function (inverse quantile) of

`input`

at`value`

. In other words, this function will return the index of the nearest element (of a sorted and non-blank)`input`

to`value`

. If the value is outside the range of the input, then this function will return`GAL_BLANK_SIZE_T`

.

- Function:

`gal_data_t *`

**gal_statistics_quantile_function**`(gal_data_t`

¶`*input`

, gal_data_t`*value`

, int`inplace`

) -
Return a single-element dataset containing the quantile function of the non-blank values in

`input`

at`value`

(a single-element dataset). The numerical data type is of the returned dataset is`float64`

(or`double`

). In other words, this function will return the quantile of`value`

in`input`

.`value`

has to have the same type as`input`

. See`gal_statistics_median`

for a description of`inplace`

.When all elements are blank, the returned value will be NaN. If the value is smaller than the input’s smallest element, the returned value will be negative infinity. If the value is larger than the input’s largest element, then the returned value will be positive infinity

- Function:

`gal_data_t *`

**gal_statistics_unique**`(gal_data_t`

¶`*input`

, int`inplace`

) Return a 1D dataset with the same numeric data type as the input, but only containing its unique elements and without any (possible) blank/NaN elements. Note that the input’s number of dimensions is irrelevant for this function. If

`inplace`

is not zero, then the unique values will over-write the allocated space of the input, otherwise a new space will be allocated and the input will not be touched.

- Function:

`int`

**gal_statistics_has_negative**`(gal_data_t`

¶`*input`

) Return

`1`

if the input dataset contains a negative number and`0`

otherwise. If the dataset doesn’t have a numeric type (as in a string), this function will abort with, saying that it does not recognize the file type.

- Function:

`gal_data_t *`

**gal_statistics_mode**`(gal_data_t`

¶`*input`

, float`mirrordist`

, int`inplace`

) Return a four-element (

`double`

or`float64`

) dataset that contains the mode of the`input`

distribution. This function implements the non-parametric algorithm to find the mode that is described in Appendix C of Akhlaghi and Ichikawa 2015.In short it compares the actual distribution and its “mirror distribution” to find the mode. In order to be efficient, you can determine how far the comparison goes away from the mirror through the

`mirrordist`

parameter (think of it as a multiple of sigma/error). See`gal_statistics_median`

for a description of`inplace`

.The output array has the following elements (in the given order, note that counting in C starts from 0).

array[0]: mode array[1]: mode quantile. array[2]: symmetricity. array[3]: value at the end of symmetricity.

- Function:

`gal_data_t *`

**gal_statistics_mode_mirror_plots**`(gal_data_t`

¶`*input`

, gal_data_t`*value`

, size_t`numbins`

, int`inplace`

, double`*mirror_val`

) Make a mirrored histogram and cumulative frequency plot (with

`numbins`

) with the mirror distribution of the`input`

having a value in`value`

. If all the input elements are blank, or the mirror value is outside the range of the input, this function will return a`NULL`

pointer.The output is a list of data structures (see List of

`gal_data_t`

): the first is the bins with one bin at the mirror point, the second is the histogram with a maximum of one and the third is the cumulative frequency plot (with a maximum of one).

- Function:

`int`

**gal_statistics_is_sorted**`(gal_data_t`

¶`*input`

, int`updateflags`

) Return

`0`

if the input is not sorted, if it is sorted, this function will return`1`

and`2`

if it is increasing or decreasing, respectively. This function will abort with an error if`input`

has zero elements and will return`1`

(sorted, increasing) when there is only one element. This function will only look into the dataset if the`GAL_DATA_FLAG_SORT_CH`

bit of`input->flag`

is`0`

, see Generic data container (`gal_data_t`

).When the flags do not indicate a previous check

*and*`updateflags`

is non-zero, this function will set the flags appropriately to avoid having to re-check the dataset in future calls (this can be very useful when repeated checks are necessary). When`updateflags==0`

, this function has no side-effects on the dataset: it will not toggle the flags.If you want to re-check a dataset with the blank-value-check flag already set (for example, if you have made changes to it), then explicitly set the

`GAL_DATA_FLAG_SORT_CH`

bit to zero before calling this function. When there are no other flags, you can simply set the flags to zero (with`input->flag=0`

), otherwise you can use this expression:input->flag &= ~GAL_DATA_FLAG_SORT_CH;

- Function:

`void`

**gal_statistics_sort_increasing**`(gal_data_t`

¶`*input`

) Sort the input dataset (in place) in an increasing order and toggle the sort-related bit flags accordingly.

- Function:

`void`

**gal_statistics_sort_decreasing**`(gal_data_t`

¶`*input`

) Sort the input dataset (in place) in a decreasing order and toggle the sort-related bit flags accordingly.

- Function:

`gal_data_t *`

**gal_statistics_no_blank_sorted**`(gal_data_t`

¶`*input`

, int`inplace`

) Remove all the blanks and sort the input dataset. If

`inplace`

is non-zero this will happen on the input dataset (in the allocated space of the input dataset). However, if`inplace`

is zero, this function will allocate a new copy of the dataset and work on that. Therefore if`inplace==0`

, the input dataset will be modified.This function uses the bit flags of the input, so if you have modified the dataset, set

`input->flag=0`

before calling this function. Also note that`inplace`

is only for the dataset elements. Therefore even when`inplace==0`

, if the input is already sorted*and*has no blank values, then the flags will be updated to show this.If all the elements were blank, then the returned dataset’s

`size`

will be zero. This is thus a good parameter to check after calling this function to see if there actually were any non-blank elements in the input or not and take the appropriate measure. This can help avoid strange bugs in later steps. The flags of a zero-sized returned dataset will indicate that it has no blanks and is sorted in an increasing order. Even if having blank values or being sorted is not defined on a zero-element dataset, it is up to the caller to choose what they will do with a zero-element dataset. The flags have to be set after this function any way.

- Function:

`gal_data_t *`

**gal_statistics_regular_bins**`(gal_data_t`

¶`*input`

, gal_data_t`*inrange`

, size_t`numbins`

, double`onebinstart`

) Generate an array of regularly spaced elements as a 1D array (column) of type

`double`

(i.e.,`float64`

, it has to be double to account for small differences on the bin edges). The input arguments are described below`input`

The dataset you want to apply the bins to. This is only necessary if the range argument is not complete, see below. If

`inrange`

has all the necessary information, you can pass a`NULL`

pointer for this.`inrange`

This dataset keeps the desired range along each dimension of the input data structure, it has to be in

`float`

(i.e.,`float32`

) type.- If you want the full range of the dataset (in any dimensions, then just set
`inrange`

to`NULL`

and the range will be specified from the minimum and maximum value of the dataset (`input`

cannot be`NULL`

in this case). - If there is one element for each dimension in range, then it is viewed as a quantile (Q), and the range will be: ‘Q to 1-Q’.
- If there are two elements for each dimension in range, then they are assumed to be your desired minimum and maximum values. When either of the two are NaN, the minimum and maximum will be calculated for it.

- If you want the full range of the dataset (in any dimensions, then just set
`numbins`

The number of bins: must be larger than 0.

`onebinstart`

A desired value to start one bin. Note that with this option, the bins will not start and end exactly on the given range values, it will be slightly shifted to accommodate this request (enough for the bin containing the value to start at it). If you do not have any preference on where to start a bin, set this to NAN.

- Function:

`gal_data_t *`

**gal_statistics_histogram**`(gal_data_t`

¶`*input`

, gal_data_t`*bins`

, int`normalize`

, int`maxone`

) -
Make a histogram of all the elements in the given dataset with bin values that are defined in the

`bins`

structure (see`gal_statistics_regular_bins`

, they currently have to be equally spaced). The returned histogram is a 1-D`gal_data_t`

of type`GAL_TYPE_FLOAT32`

, with the same number of elements as`bins`

. For each bin, it will contain the number of input elements that fell inside of that bin.Let’s write the center of the \(i\)th element of the bin array as \(b_i\), and the fixed half-bin width as \(h\). Then element \(j\) of the input array (\(in_j\)) will be counted in \(b_i\) if \((b_i-h) \le in_j < (b_i+h)\). However, if \(in_j\) is somewhere in the last bin, the condition changes to \((b_i-h) \le in_j \le (b_i+h)\).

If

`normalize!=0`

, the histogram will be “normalized” such that the sum of the counts column will be one. In other words, all the counts in every bin will be divided by the total number of counts. If`maxone!=0`

, the histogram’s maximum count will be 1. In other words, the counts in every bin will be divided by the value of the maximum. In both of these cases, the output dataset will have a`GAL_DATA_FLOAT32`

datatype.

- Function:

`gal_data_t *`

**gal_statistics_histogram2d**`(gal_data_t`

¶`*input`

, gal_data_t`*bins`

) -
This function is very similar to

`gal_statistics_histogram`

, but will build a 2D histogram (count how many of the elements of`input`

are a within a 2D box. The bins comprising the first dimension of the 2D box are defined by`bins`

. The bins of the second dimension are defined by`bins->next`

(`bins`

is a List of`gal_data_t`

). Both the`bin`

and`bin->next`

can be created with`gal_statistics_regular_bins`

.This function returns a list of

`gal_data_t`

with three nodes/columns, so you can directly write them into a table (see Table input output (`table.h`)). Assuming`bins`

has \(N1\) bins and`bins->next`

has \(N2\) bins, each node/column of the returned output is a 1D array with \(N1\times N2\) elements. The first and second columns are the center of the 2D bin along the first and second dimensions and have a`double`

data type. The third column is the 2D histogram (the number of input elements that have a value within that 2D bin) and has a`uint32`

data type (see Numeric data types).

- Function:

`gal_data_t *`

**gal_statistics_cfp**`(gal_data_t`

¶`*input`

, gal_data_t`*bins`

, int`normalize`

) Make a cumulative frequency plot (CFP) of all the elements in

`input`

with bin values that are defined in the`bins`

structure (see`gal_statistics_regular_bins`

).The CFP is built from the histogram: in each bin, the value is the sum of all previous bins in the histogram. Thus, if you have already calculated the histogram before calling this function, you can pass it onto this function as the data structure in

`bins->next`

(see`List of gal_data_t`

). If`bin->next!=NULL`

, then it is assumed to be the histogram. If it is`NULL`

, then the histogram will be calculated internally and freed after the job is finished.When a histogram is given and it is normalized, the CFP will also be normalized (even if the normalized flag is not set here): note that a normalized CFP’s maximum value is 1.

- Function:

`gal_data_t *`

**gal_statistics_clip_sigma**`(gal_data_t`

¶`*input`

, float`multip`

, float`param`

, float`extrastats`

, int`inplace`

, int`quiet`

) Apply \(\sigma\)-clipping on a given dataset and return a dataset that contains the results. For a description of \(\sigma\)-clipping see Sigma clipping.

`multip`

is the multiple of the standard deviation (or \(\sigma\), that is used to define outliers in each round of clipping).The role of

`param`

is determined based on its value. If`param`

is larger than`1`

(one), it must be an integer and will be interpreted as the number clips to do. If it is less than`1`

(one), it is interpreted as the tolerance level to stop the iteration.The returned dataset (let’s call it

`out`

) contains a 6-element array with type`GAL_TYPE_FLOAT32`

. Through the`GAL_STATISTICS_CLIP_OUTCOL_*`

macros below, you can access any particular measurement.out=gal_statistics_clip_sigma(input, ....); float *array=out->array; array[ GAL_STATISTICS_CLIP_OUTCOL_NUMBER_USED ] array[ GAL_STATISTICS_CLIP_OUTCOL_MEAN ] array[ GAL_STATISTICS_CLIP_OUTCOL_STD ] array[ GAL_STATISTICS_CLIP_OUTCOL_MEDIAN ] array[ GAL_STATISTICS_CLIP_OUTCOL_MAD ] array[ GAL_STATISTICS_CLIP_OUTCOL_NUMBER_CLIPS ]

However, note that all are not measured by default! Since the mean and MAD are not necessary during sigma-clipping, if you want them, you have to set the following two bit flags in the

`extrastats`

argument as below.int extrastats=0; /* To initialize all bits */ /* If you want the sigma-clipped MAD. */ extrastats |= GAL_STATISTICS_CLIP_OUTCOL_OPTIONAL_MAD; /* If you want the sigma-clipped mean. */ extrastats |= GAL_STATISTICS_CLIP_OUTCOL_OPTIONAL_MEAN;

If the \(\sigma\)-clipping does not converge or all input elements are blank, then this function will return NaN values for all the elements above.

- Function:

`gal_data_t *`

**gal_statistics_clip_mad**`(gal_data_t`

¶`*input`

, float`multip`

, float`param`

, uint8_t`extrastats`

, int`inplace`

, int`quiet`

) Similar to

`gal_statistics_clip_sigma`

, but will do median absolute deviation (MAD) based clipping, see MAD clipping.The only difference is that for this function the MAD is automatically calculated during clipping. It is the mean and standard deviation that will not be calculated unless requested with the

`GAL_STATISTICS_CLIP_OUTCOL_OPTIONAL_MEAN`

and`GAL_STATISTICS_CLIP_OUTCOL_OPTIONAL_STD`

bit flats respectively.

- Function:

`gal_data_t *`

**gal_statistics_outlier_bydistance**`(int`

¶`pos1_neg0`

, gal_data_t`*input`

, size_t`window_size`

, float`sigma`

, float`sigclip_multip`

, float`sigclip_param`

, int`inplace`

, int`quiet`

) -
Find the first positive outlier (if

`pos1_neg0!=0`

) in the`input`

distribution. When`pos1_neg0==0`

, the same algorithm goes to the start of the dataset. The returned dataset contains a single element: the first positive outlier. It is one of the dataset’s elements, in the same type as the input. If the process fails for any reason (for example, no outlier was found), a`NULL`

pointer will be returned.All (possibly existing) blank elements are first removed from the input dataset, then it is sorted. A sliding window of

`window_size`

elements is parsed over the dataset. Starting from the`window_size`

-th element of the dataset, in the direction of increasing values. This window is used as a reference. The first element where the distance to the previous (sorted) element is`sigma`

units away from the distribution of distances in its window is considered an outlier and returned by this function.Formally, if we assume there are \(N\) non-blank elements. They are first sorted. Searching for the outlier starts on element \(W\). Let’s take \(v_i\) to be the \(i\)-th element of the sorted input (with no blank values) and \(m\) and \(\sigma\) as the \(\sigma\)-clipped median and standard deviation from the distances of the previous \(W\) elements (not including \(v_i\)). If the value given to

`sigma`

is displayed with \(s\), the \(i\)-th element is considered as an outlier when the condition below is true.$${(v_i-v_{i-1})-m\over \sigma}>s$$

The

`sigclip_multip`

and`sigclip_param`

arguments specify the properties of the \(\sigma\)-clipping (see Sigma clipping for more). You see that by this definition, the outlier cannot be any of the lower half elements. The advantage of this algorithm compared to \(\sigma\)-clippign is that it only looks backwards (in the sorted array) and parses it in one direction.If

`inplace!=0`

, the removing of blank elements and sorting will be done within the input dataset’s allocated space. Otherwise, this function will internally allocate (and later free) the necessary space to keep the intermediate space that this process requires.If

`quiet!=0`

, this function will report the parameters every time it moves the window as a separate line with several columns. The first column is the value, the second (in square brackets) is the sorted index, the third is the distance of this element from the previous one. The Fourth and fifth (in parenthesis) are the median and standard deviation of the \(\sigma\)-clipped distribution within the window and the last column is the difference between the third and fourth, divided by the fifth.

- Function:

`gal_data_t *`

**gal_statistics_outlier_flat_cfp**`(gal_data_t`

¶`*input`

, size_t`numprev`

, float`sigclip_multip`

, float`sigclip_param`

, float`thresh`

, size_t`numcontig`

, int`inplace`

, int`quiet`

, size_t`*index`

) -
Return the first element in the given dataset where the cumulative frequency plot first becomes significantly flat for a sufficient number of elements. The returned dataset only has one element (with the same type as the input). If

`index!=NULL`

, the index (counting from zero, after sorting the dataset and removing any blanks) is written in the space that`index`

points to. If no sufficiently flat portion is found, the returned pointer will be`NULL`

.The flatness on the cumulative frequency plot is defined like this (see Histogram and Cumulative Frequency Plot): on the sorted dataset, for every point (\(a_i\)), we calculate \(d_i=a_{i+2}-a_{i-2}\). This done on the first \(N\) elements (value of

`numprev`

). After element \(a_{N+2}\), we start estimating the flatness as follows: for every element we use the \(N\), \(d_i\) measurements before it as the reference. Let’s call this set \(D_i\) for element \(i\). The \(\sigma\)-clipped median (\(m\)) and standard deviation (\(s\)) of \(D_i\) are then calculated. The \(\sigma\)-clipping can be configured with the two`sigclip_param`

and`sigclip_multip`

arguments.Taking \(t\) as the significance threshold (value to

`thresh`

), a point is considered flat when \(a_i>m+t\sigma\). But a single point satisfying this condition will probably just be due to noise. To make a more robust estimate, this significance/condition has to hold for`numcontig`

contiguous elements after \(a_i\). When this is satisfied, \(a_i\) is returned as the point where the distribution’s cumulative frequency plot becomes flat.To get a good estimate of \(m\) and \(s\), it is thus recommended to set

`numprev`

as large as possible. However, be careful not to set it too high: the checks in the paragraph above are not done on the first`numprev`

elements and this function assumes the flatness occurs after them. Also, be sure that the value to`numcontig`

is much less than`numprev`

, otherwise \(\sigma\)-clipping may not be able to remove the immediate outliers in \(D_i\) near the boundary of the flat region.When

`quiet==0`

, the basic measurements done on each element are printed on the command-line (good for finding the best parameters). When`inplace!=0`

, the sorting and removal of blank elements is done on the input dataset, so the input may be altered after this function.

JavaScript license information

GNU Astronomy Utilities 0.22 manual, February 2024.