Print the number of all used (non-blank and in range) elements.
Print the minimum value of all used elements.
Print the maximum value of all used elements.
Print the sum of all used elements.
Print the mean (average) of all used elements.
Print the standard deviation of all used elements.
Print the median of all used elements.
Print the values at the given quantiles of the input dataset.
Any number of quantiles may be given and one number will be printed for each.
Values can either be written as a single number or as fractions, but must be between zero and one (inclusive).
Hence, in effect --quantile=0.25 --quantile=0.75
is equivalent to --quantile=0.25,3/4, or -u1/4,3/4.
The returned value is one of the elements from the dataset. Taking \(q\) to be your desired quantile, and \(N\) to be the total number of used (non-blank and within the given range) elements, the returned value is at the following position in the sorted array: \(round(q\times{}N\)).
Print the quantiles of the given values in the dataset. This option is the inverse of the --quantile and operates similarly except that the acceptable values are within the range of the dataset, not between 0 and 1. Formally it is known as the “Quantile function”.
Since the dataset is not continuous this function will find the nearest element of the dataset and use its position to estimate the quantile function.
Print the quantile of the mean in the dataset. This is a very good measure of detecting skewness or outliers. The concept is used by programs like NoiseChisel to identify the presence of signal in a tile of the image (because signal in noise causes skewness).
For example, take this simple array: 1 2 20 4 5 6 3
.
The mean is 5.85
.
The nearest element to this mean is 6
and the quantile of 6
in this distribution is 0.8333.
Here is how we got to this: in the sorted dataset (1 2 3 4 5 6 20
), 6
is the 5-th element (counting from zero, since a quantile of zero corresponds to the minimum, by definition) and the maximum is the 6-th element (again, counting from zero).
So the quantile of the mean in this case is \(5/6=0.8333\).
In the example above, if we had 7
instead of 20
(which was an outlier), then the mean would be 4
and the quantile of the mean would be 0.5 (which by definition, is the quantile of the median), showing no outliers.
As the number of elements increases, the mean itself is less affected by a small number of outliers, but skewness can be nicely identified by the quantile of the mean.
Print the mode of all used elements. The mode is found through the mirror distribution which is fully described in Appendix C of Akhlaghi and Ichikawa 2015. See that section for a full description.
This mode calculation algorithm is non-parametric, so when the dataset is not large enough (larger than about 1000 elements usually), or does not have a clear mode it can fail.
In such cases, this option will return a value of nan
(for the floating point NaN value).
As described in that paper, the easiest way to assess the quality of this mode calculation method is to use it’s symmetricity (see --modesym below). A better way would be to use the --mirror option to generate the histogram and cumulative frequency tables for any given mirror value (the mode in this case) as a table. If you generate plots like those shown in Figure 21 of that paper, then your mode is accurate.
Print the quantile of the mode. You can get the actual mode value from the --mode described above. In many cases, the absolute value of the mode is irrelevant, but its position within the distribution is important. In such cases, this option will become handy.
Print the symmetricity of the calculated mode. See the description of --mode for more. This mode algorithm finds the mode based on how symmetric it is, so if the symmetricity returned by this option is too low, the mode is not too accurate. See Appendix C of Akhlaghi and Ichikawa 2015 for a full description. In practice, symmetricity values larger than 0.2 are mostly good.
Print the value in the distribution where the mirror and input distributions are no longer symmetric, see --mode and Appendix C of Akhlaghi and Ichikawa 2015 for more.
Number of elements after applying \(\sigma\)-clipping (see Sigma clipping). \(\sigma\)-clipping configuration is done with the --sigclipparams option.
Median after applying \(\sigma\)-clipping (see Sigma clipping). \(\sigma\)-clipping configuration is done with the --sigclipparams option.
Here is one scenario where this can be useful: assume you have a table and you would like to remove the rows that are outliers (not within the \(\sigma\)-clipping range).
Let’s assume your table is called table.fits and you only want to keep the rows that have a value in COLUMN
within the \(\sigma\)-clipped range (to \(3\sigma\), with a tolerance of 0.1).
This command will return the \(\sigma\)-clipped median and standard deviation (used to define the range later).
$ aststatistics table.fits -cCOLUMN --sclipparams=3,0.1 \ --sigclip-median --sigclip-std
You can then use the --range option of Table (see Table) to select the proper rows. But for that, you need the actual starting and ending values of the range (\(m\pm s\sigma\); where \(m\) is the median and \(s\) is the multiple of sigma to define an outlier). Therefore, the raw outputs of Statistics in the command above are not enough.
To get the starting and ending values of the non-outlier range (and put a ‘,’ between them, ready to be used in --range), pipe the result into AWK.
But in AWK, we will also need the multiple of \(\sigma\), so we will define it as a shell variable (s
) before calling Statistics (note how $s
is used two times now):
$ s=3 $ aststatistics table.fits -cCOLUMN --sclipparams=$s,0.1 \ --sigclip-median --sigclip-std \ | awk '{s='$s'; printf("%f,%f\n", $1-s*$2, $1+s*$2)}'
To pass it onto Table, we will need to keep the printed output from the command above in another shell variable (r
), not print it.
In Bash, can do this by putting the whole statement within a $()
:
$ s=3 $ r=$(aststatistics table.fits -cCOLUMN --sclipparams=$s,0.1 \ --sigclip-median --sigclip-std \ | awk '{s='$s'; printf("%f,%f\n", $1-s*$2, $1+s*$2)}') $ echo $r # Just to confirm.
Now you can use Table with the --range option to only print the rows that have a value in COLUMN
within the desired range:
$ asttable table.fits --range=COLUMN,$r
To save the resulting table (that is clean of outliers) in another file (for example, named cleaned.fits, it can also have a .txt suffix), just add --output=cleaned.fits to the command above.
Mean after applying \(\sigma\)-clipping (see Sigma clipping). \(\sigma\)-clipping configuration is done with the --sigclipparams option.
Standard deviation after applying \(\sigma\)-clipping (see Sigma clipping). \(\sigma\)-clipping configuration is done with the --sigclipparams option.
JavaScript license information
GNU Astronomy Utilities 0.20 manual, April 2023.