Single value measurements (GNU Astronomy Utilities)

Next: Generating histograms and cumulative freq., Previous: Input to Statistics, Up: Invoking Statistics [Contents][Index]

7.1.5.2 Single value measurements ¶

-n

--number

Print the number of all used (non-blank and in range) elements.

--minimum

Print the minimum value of all used elements.

--maximum

Print the maximum value of all used elements.

--sum

Print the sum of all used elements.

-m

--mean

Print the mean (average) of all used elements.

-t

--std

Print the standard deviation of all used elements.

--mad

Print the median absolute deviation (MAD) of all used elements.

-E

--median

Print the median of all used elements.

-u FLT[,FLT[,...]]

--quantile=FLT[,FLT[,...]]

Print the values at the given quantiles of the input dataset. Any number of quantiles may be given and one number will be printed for each. Values can either be written as a single number or as fractions, but must be between zero and one (inclusive). Hence, in effect --quantile=0.25 --quantile=0.75 is equivalent to --quantile=0.25,3/4, or -u1/4,3/4.

The returned value is one of the elements from the dataset. Taking $q$ to be your desired quantile, and $N$ to be the total number of used (non-blank and within the given range) elements, the returned value is at the following position in the sorted array: $round(q\times{}N$).

--quantfunc=FLT[,FLT[,...]]

Print the quantiles of the given values in the dataset. This option is the inverse of the --quantile and operates similarly except that the acceptable values are within the range of the dataset, not between 0 and 1. Formally it is known as the “Quantile function”.

Since the dataset is not continuous this function will find the nearest element of the dataset and use its position to estimate the quantile function.

--quantofmean ¶

Print the quantile of the mean in the dataset. This is a very good measure of detecting skewness or outliers. The concept is used by programs like NoiseChisel to identify the presence of signal in a tile of the image (because signal in noise causes skewness).

For example, take this simple array: 1 2 20 4 5 6 3. The mean is 5.85. The nearest element to this mean is 6 and the quantile of 6 in this distribution is 0.8333. Here is how we got to this: in the sorted dataset (1 2 3 4 5 6 20), 6 is the 5-th element (counting from zero, since a quantile of zero corresponds to the minimum, by definition) and the maximum is the 6-th element (again, counting from zero). So the quantile of the mean in this case is $5/6=0.8333$.

In the example above, if we had 7 instead of 20 (which was an outlier), then the mean would be 4 and the quantile of the mean would be 0.5 (which by definition, is the quantile of the median), showing no outliers. As the number of elements increases, the mean itself is less affected by a small number of outliers, but skewness can be nicely identified by the quantile of the mean.

-O

--mode

Print the mode of all used elements. The mode is found through the mirror distribution which is fully described in Appendix C of Akhlaghi and Ichikawa 2015. See that section for a full description.

This mode calculation algorithm is non-parametric, so when the dataset is not large enough (larger than about 1000 elements usually), or does not have a clear mode it can fail. In such cases, this option will return a value of nan (for the floating point NaN value).

As described in that paper, the easiest way to assess the quality of this mode calculation method is to use it’s symmetricity (see --modesym below). A better way would be to use the --mirror option to generate the histogram and cumulative frequency tables for any given mirror value (the mode in this case) as a table. If you generate plots like those shown in Figure 21 of that paper, then your mode is accurate.

--modequant

Print the quantile of the mode. You can get the actual mode value from the --mode described above. In many cases, the absolute value of the mode is irrelevant, but its position within the distribution is important. In such cases, this option will become handy.

--modesym

Print the symmetricity of the calculated mode. See the description of --mode for more. This mode algorithm finds the mode based on how symmetric it is, so if the symmetricity returned by this option is too low, the mode is not too accurate. See Appendix C of Akhlaghi and Ichikawa 2015 for a full description. In practice, symmetricity values larger than 0.2 are mostly good.

--modesymvalue

Print the value in the distribution where the mirror and input distributions are no longer symmetric, see --mode and Appendix C of Akhlaghi and Ichikawa 2015 for more.

--sigclip-std

--sigclip-mad

--sigclip-mean

--sigclip-number

--sigclip-median

Calculate the desired statistic after applying $\sigma$-clipping (see Sigma clipping, part of the tutorial Clipping outliers). $\sigma$-clipping configuration is done with the --sclipparams option.

Here is one scenario where this can be useful: assume you have a table and you would like to remove the rows that are outliers (not within the $\sigma$-clipping range). Let’s assume your table is called table.fits and you only want to keep the rows that have a value in COLUMN within the $\sigma$-clipped range (to $3\sigma$, with a tolerance of 0.1). This command will return the $\sigma$-clipped median and standard deviation (used to define the range later).

$ aststatistics table.fits -cCOLUMN --sclipparams=3,0.1 \
                --sigclip-median --sigclip-std

You can then use the --range option of Table (see Table) to select the proper rows. But for that, you need the actual starting and ending values of the range ($m\pm s\sigma$; where $m$ is the median and $s$ is the multiple of sigma to define an outlier). Therefore, the raw outputs of Statistics in the command above are not enough.

To get the starting and ending values of the non-outlier range (and put a ‘,’ between them, ready to be used in --range), pipe the result into AWK. But in AWK, we will also need the multiple of $\sigma$, so we will define it as a shell variable (s) before calling Statistics (note how $s is used two times now):

$ s=3
$ aststatistics table.fits -cCOLUMN --sclipparams=$s,0.1 \
                --sigclip-median --sigclip-std           \
     | awk '{s='$s'; printf("%f,%f\n", $1-s*$2, $1+s*$2)}'

To pass it onto Table, we will need to keep the printed output from the command above in another shell variable (r), not print it. In Bash, can do this by putting the whole statement within a $():

$ s=3
$ r=$(aststatistics table.fits -cCOLUMN --sclipparams=$s,0.1 \
                    --sigclip-median --sigclip-std           \
        | awk '{s='$s'; printf("%f,%f\n", $1-s*$2, $1+s*$2)}')
$ echo $r      # Just to confirm.

Now you can use Table with the --range option to only print the rows that have a value in COLUMN within the desired range:

$ asttable table.fits --range=COLUMN,$r

To save the resulting table (that is clean of outliers) in another file (for example, named cleaned.fits, it can also have a .txt suffix), just add --output=cleaned.fits to the command above.

--madclip-std

--madclip-mad

--madclip-mean

--madclip-number

--madclip-median

Calculate the desired statistic after applying median absolute deviation (MAD) clipping (see MAD clipping, part of the tutorial Clipping outliers). MAD-clipping configuration is done with the --mclipparams option.

This option behaves similarly to --sigclip-* options, read their description for usage examples.

--concentration=FLT[,FLT[,...]]

Return the “concentration” around the median (see rest of this description for the definition); the input value(s) are the quantile width(s) where it is measured.

For a uniform distribution, the output of this operation will be approximately $1.0$. With a higher density of values around the median, the value will be larger for a Gaussian distribution, and even larger for more concentrated distributions (than a Gaussian).

This is the algorithm used to measure this value:

Sort the input dataset and remove all blank values. If there is one non-blank value or less, then return NaN.
The minimum and maximum are respectively selected to be the second and second-to-last elements in the sorted array. The first and last elements are not selected as minimum and maximum because they are affected too strongly by scatter.
Subtract each element from the minimum, and divide it by the difference between the minimum and maximum. After this operation, the input’s values²⁰¹ will be between 0 and 1.
This scaling does not change the order of the input elements; instead, each element’s value now shows its relation to the range of the whole distribution’s values (the minimum and maximum values above).
Calculate the scaled values corresponding to quantiles that are defined by the width above. For example, if the given width (value to this option) is 0.2, the quantiles of $0.5-(0.2/2)=0.4$ and $0.5+(0.2/2)=0.5$ will be measured.
The width is divided by the difference between the quantiles and returned as the concentration.

In a uniform distribution, the scaling step will convert each input into its quantile: the spacing between scaled values will be uniform. As a result, the difference between the quantiles measured around the median will be equal to the input width and the result will be approximately $1.0$. However, if the distribution is concentrated around the median, the spacing between the scaled values will be much less around the median and the quantile difference will be less than the width. Therefore, when we divide the width by the quantile difference, the value will be larger than one.

The example commands below create two randomly distributed “noisy” images, one with a Gaussian distribution and one with a uniform distribution. We will then run this option on both to see the different concentrations²⁰². See Generating histograms and cumulative freq. on how you can generate the histogram of these two images on the command-line to visualize the distribution.

$ astarithmetic 1000 1000 2 makenew 10 mknoise-sigma \
                --output=gaussian.fits

$ astarithmetic 1000 1000 2 makenew 10 mknoise-uniform \
                --output=uniform.fits

$ aststatistics gaussian.fits --concentration=0.25
3.71347573489440e+00

$ aststatistics uniform.fits  --concentration=0.25
9.99988794452348e-01

Note that this option is primarily designed for symmetric distributions, not skewed ones (where the mode and median will be distant). Here, we define the “center” in “concentration” as the median, not the mode. To check if the distribution is symmetric (that the mode and median are similar), you can use the --quantofmean option described above. Recall that you can call all the options in this section in one call to the Statistics program like below:

$ aststatistics gaussian.fits \
                --quantofmean --concentration=0.25
5.00260500260500e-01    3.71347573489440e+00

From the quantile-of-mean value of approximately 0.5, we see that the distribution is symmetric and from the concentration, we see that it is not a uniform one.

Footnotes

(201)

Technically, the second sorted value will be 0 and the second-to-last value will be 1.

(202)

The values you get will be slightly different because of the different random seeds. To get a reproducible result, see Generating random numbers.