Previous: Sky value, Up: Statistics [Contents][Index]

Statistics will print statistical measures of an input dataset (table
column or image). The executable name is `aststatistics` with the
following general template

$ aststatistics [OPTION ...] InputImage.fits

One line examples:

## Print some general statistics of input image: $ aststatistics image.fits ## Print some general statistics of column named MAG_F160W: $ aststatistics catalog.fits -h1 --column=MAG_F160W ## Make the histogram of the column named MAG_F160W: $ aststatistics table.fits -cMAG_F160W --histogram ## Find the Sky value on image with a given kernel: $ aststatistics image.fits --sky --kernel=kernel.fits ## Print Sigma-clipped results of records with a MAG_F160W ## column value between 26 and 27: $ aststatistics cat.fits -cMAG_F160W -g26 -l27 --sigmaclip=3,0.2 ## Print the median value of all records in column MAG_F160W that ## have a value larger than 3 in column PHOTO_Z: $ aststatistics tab.txt -rPHOTO_Z -g3 -cMAG_F160W --median

An input image or table is necessary when processing is to be done. If any
output file is to be created, the value to the `--output` option, is
used as the base name for the generated files. Without `--output`,
the input name will be used to generate an output name, see Automatic output. The options described below are particular to Statistics, but for
general operations, it shares a large collection of options with the other
Gnuastro programs, see Common options for the full list. Options can
also be given in configuration files, for more, please see
Configuration files.

The input dataset may have blank values (see Blank pixels), in this
case, all blank pixels are ignored during the calculation. Initially, the
full dataset will be read, but it is possible to select a specific range of
data elements to use in the analysis of each run. You can either directly
specify a minimum and maximum value for the range of data elements to use
(with `--greaterequal` or `--lessthan`), or specify the range
using quantiles (with `--qrange`). If a range is specified, all
pixels outside of it are ignored before any processing.

The following set of options are for specifying the input/outputs of Statistics. There are many other input/output options that are common to all Gnuastro programs including Statistics, see Input/Output options for those.

`-c STR/INT``--column=STR/INT`The input column selector when the input file is a table. See Selecting table columns for a full description of how to use this option. For more on how tables are read in Gnuastro, please see Tables.

`-r STR/INT``--refcol=STR/INT`The reference column selector when the input file is a table. When a reference column is given, the range options below will be applied to this column and only elements in the input column that have a reference value in the correct range will be used. In practice this option allows you to select a subset of the input column based on values in another (the reference) column. All the statistical calculations will be done on the selected input column, not the reference column.

`-g FLT``--greaterequal=FLT`Limit the range of inputs into those with values greater and equal to what is given to this option. None of the values below this value will be used in any of the processing steps below.

`-l FLT``--lessthan=FLT`Limit the range of inputs into those with values less-than what is given to this option. None of the values greater or equal to this value will be used in any of the processing steps below.

`-Q FLT[,FLT]``--qrange=FLT[,FLT]`Specify the range of usable inputs using the quantile. This option can take one or two quantiles to specify the range. When only one number is input (let’s call it \(Q\)), the range will be those values in the quantile range \(Q\) to \(1-Q\). So when only one value is given, it must be less than 0.5. When two values are given, the first is used as the lower quantile range and the second is used as the larger quantile range.

The quantile of a given element in a dataset is defined by the fraction of its index to the total number of values in the sorted input array. So the smallest and largest values in the dataset have a quantile of 0.0 and 1.0. The quantile is a very useful non-parametric (making no assumptions about the input) relative measure to specify a range. It can best be understood in terms of the cumulative frequency plot, see Histogram and Cumulative Frequency Plot. The quantile of each horizontal axis value in the cumulative frequency plot is the vertical axis value associate with it.

When no operation is requested, Statistics will print some general basic
properties of the input dataset on the command-line like the example below
(ran on one of the output images of `make check`

^{100}). This default behavior is
designed to help give you a general feeling of how the data are distributed
and help in narrowing down your analysis.

$ aststatistics convolve_spatial_scaled_noised.fits \ --greaterequal=9500 --lessthan=11000 Statistics (GNU Astronomy Utilities) X.X ------- Input: convolve_spatial_scaled_noised.fits (hdu: 0) Range: from (inclusive) 9500, upto (exclusive) 11000. Unit: Brightness ------- Number of elements: 9074 Minimum: 9622.35 Maximum: 10999.7 Mode: 10055.45996 Mode quantile: 0.4001983908 Median: 10093.7 Mean: 10143.98257 Standard deviation: 221.80834 ------- Histogram: | ** | ****** | ******* | ********* | ************* | ************** | ****************** | ******************** | *************************** * | ***************************************** *** |* ************************************************************** |-----------------------------------------------------------------

Gnuastro’s Statistics is a very general purpose program, so to be able to easily understand this diversity in its operations (and how to possibly run them together), we’ll divided the operations into two types: those that don’t respect the position of the elements and those that do (by tessellating the input on a tile grid, see Tessellation). The former treat the whole dataset as one and can re-arrange all the elements (for example sort them), but the former do their processing on each tile independently. First, we’ll review the operations that work on the whole dataset.

The group of options below can used to get a single value measurement of
the whole dataset. They will print only the requested value as one field in
a line/row, like the `--mean`, `--median` options. These
options can be called any number of times and in any order. The outputs of
all such options will be printed on one line following each other (with a
space character between them). This feature makes these options very useful
in scripts, or to redirect into programs like GNU AWK for higher-level
processing. These are some of the most basic measures, Gnuastro is still
under heavy development and this list will grow. If you want another
statistical parameter, please contact us and we will do out best to add it
to this list, see Suggest new feature.

`-n``--number`Print the number of all used (non-blank and in range) elements.

`--minimum`Print the minimum value of all used elements.

`--maximum`Print the maximum value of all used elements.

`--sum`Print the sum of all used elements.

`-m``--mean`Print the mean (average) of all used elements.

`-t``--std`Print the standard deviation of all used elements.

`-E``--median`Print the median of all used elements.

`-u FLT[,FLT[,...]]``--quantile=FLT[,FLT[,...]]`Print the values at the given quantiles of the input dataset. Any number of quantiles may be given and one number will be printed for each. Values can either be written as a single number or as fractions, but must be between zero and one (inclusive). Hence, in effect

`--quantile=0.25 --quantile=0.75`

is equivalent to`--quantile=0.25,3/4`, or`-u1/4,3/4`.The returned value is one of the elements from the dataset. Taking \(q\) to be your desired quantile, and \(N\) to be the total number of used (non-blank and within the given range) elements, the returned value is at the following position in the sorted array: \(round(q\times{}N\)).

`--quantfunc=FLT[,FLT[,...]]`Print the quantiles of the given values in the dataset. This option is the inverse of the

`--quantile`and operates similarly except that the acceptable values are within the range of the dataset, not between 0 and 1. Formally it is known as the “Quantile function”.Since the dataset is not continuous this function will find the nearest element of the dataset and use its position to estimate the quantile function.

`-O``--mode`Print the mode of all used elements. The mode is found through the mirror distribution which is fully described in Appendix C of Akhlaghi and Ichikawa 2015. See that section for a full description.

This mode calculation algorithm is non-parametric, so when the dataset is not large enough (larger than about 1000 elements usually), or doesn’t have a clear mode it can fail. In such cases, this option will return a value of

`nan`

(for the floating point NaN value).As described in that paper, the easiest way to assess the quality of this mode calculation method is to use it’s symmetricity (see

`--modesym`below). A better way would be to use the`--mirror`option to generate the histogram and cumulative frequency tables for any given mirror value (the mode in this case) as a table. If you generate plots like those shown in Figure 21 of that paper, then your mode is accurate.`--modequant`Print the quantile of the mode. You can get the actual mode value from the

`--mode`described above. In many cases, the absolute value of the mode is irrelevant, but its position within the distribution is important. In such cases, this option will become handy.`--modesym`Print the symmetricity of the calculated mode. See the description of

`--mode`for more. This mode algorithm finds the mode based on how symmetric it is, so if the symmetricity returned by this option is too low, the mode is not too accurate. See Appendix C of Akhlaghi and Ichikawa 2015 for a full description. In practice, symmetricity values larger than 0.2 are mostly good.`--modesymvalue`Print the value in the distribution where the mirror and input distributions are no longer symmetric, see

`--mode`and Appendix C of Akhlaghi and Ichikawa 2015 for more.

The list of options below are for those statistical operations that output more than one value. So while they can be called together in one run, their outputs will be distinct (each one’s output will usually be printed in more than one line).

`-A``--asciihist`Print an ASCII histogram of the usable values within the input dataset along with some basic information like the example below (from the UVUDF catalog

^{101}). The width and height of the histogram (in units of character widths and heights on your command-line terminal) can be set with the`--numasciibins`(for the width) and`--asciiheight`options.For a full description of the histogram, please see Histogram and Cumulative Frequency Plot. An ASCII plot is certainly very crude and cannot be used in any publication, but it is very useful for getting a general feeling of the input dataset very fast and easily on the command-line without having to take your hands off the keyboard (which is a major distraction!). If you want to try it out, you can write it all in one line and ignore the

`\`and extra spaces.$ aststatistics uvudf_rafelski_2015.fits.gz --hdu=1 \ --column=MAG_F160W --lessthan=40 \ --asciihist --numasciibins=55 ASCII Histogram: Number: 8593 Y: (linear: 0 to 660) X: (linear: 17.7735 -- 31.4679, in 55 bins) | **** | ***** | ****** | ******** | ********* | *********** | ************** | ***************** | *********************** | ******************************** |*** *************************************************** |-------------------------------------------------------

`--asciicfp`Print the cumulative frequency plot of the usable elements in the input dataset. Please see descriptions under

`--asciihist`for more, the example below is from the same input table as that example. To better understand the cumulative frequency plot, please see Histogram and Cumulative Frequency Plot.$ aststatistics uvudf_rafelski_2015.fits.gz --hdu=1 \ --column=MAG_F160W --lessthan=40 \ --asciicfp --numasciibins=55 ASCII Cumulative frequency plot: Y: (linear: 0 to 8593) X: (linear: 17.7735 -- 31.4679, in 55 bins) | ******* | ********** | *********** | ************* | ************** | *************** | ***************** | ******************* | *********************** | ****************************** |******************************************************* |-------------------------------------------------------

`-H``--histogram`Save the histogram of the usable values in the input dataset into a table. The first column is the value at the center of the bin and the second is the number of points in that bin. If the

`--cumulative`option is also called with this option in a run, then the table will have three columns (the third is the cumulative frequency plot). Through the`--numbins`and`--lowerbin`you can modify the first column values and with`--normalize`and`--maxbinone`you can modify the second columns. See below for the description of each.By default (when no

`--output`is specified) a plain text table will be created, see Gnuastro text table format. If a FITS name is specified, you can use the common option`--tableformat`to have it as a FITS ASCII or FITS binary format, see Common options. This table can then be fed into your favorite plotting tool and get a much more clean and nice histogram than what the raw command-line can offer you (with the`--asciihist`option).`-C``--cumulative`Save the cumulative frequency plot of the usable values in the input dataset into a table, similar to

`--histogram`.`-s``--sigmaclip`Do \(\sigma\)-clipping on the usable pixels of the input dataset. See Sigma clipping for a full description on \(\sigma\)-clipping and also to better understand this option. The \(\sigma\)-clipping parameters can be set through the

`--sclipparams`option (see below).`--mirror=FLT`Make a histogram and cumulative frequency plot of the mirror distribution for the given dataset when the mirror is located at the value to this option. The mirror distribution is fully described in Appendix C of Akhlaghi and Ichikawa 2015 and currently it is only used to calculate the mode (see

`--mode`).Just note that the mirror distribution is a discrete distribution like the input, so while you may give any number as the value to this option, the actual mirror value is the closest number in the input dataset to this value. If the two numbers are different, Statistics will warn you of the actual mirror value used.

This option will make a table as output. Depending on your selected name for the output, it will be either a FITS table or a plain text table (which is the default). It contains three columns: the first is the center of the bins, the second is the histogram (with the largest value set to 1) and the third is the normalized cumulative frequency plot of the mirror distribution. The bins will be positioned such that the mode is on the starting interval of one of the bins to make it symmetric around the mirror. With this output file and the input histogram (that you can generate in another run of Statistics, using the

`--onebinvalue`), it is possible to make plots like Figure 21 of Akhlaghi and Ichikawa 2015.

The list of options below allow customization of the histogram and
cumulative frequency plots (for the `--histogram`,
`--cumulative`, `--asciihist`, and `--asciicfp`
options).

`--numbins`The number of bins (rows) to use in the histogram and the cumulative frequency plot tables (outputs of

`--histogram`and`--cumulative`).`--numasciibins`The number of bins (characters) to use in the ASCII plots when printing the histogram and the cumulative frequency plot (outputs of

`--asciihist`and`--asciicfp`).`--asciiheight`The number of lines to use when printing the ASCII histogram and cumulative frequency plot on the command-line (outputs of

`--asciihist`and`--asciicfp`).`-n``--normalize`Normalize the histogram or cumulative frequency plot tables (outputs of

`--histogram`and`--cumulative`). For a histogram, the sum of all bins will become one and for a cumulative frequency plot the last bin value will be one.`--maxbinone`Divide all the histogram values by the maximum bin value so it becomes one and the rest are similarly scaled. In some situations (for example if you want to plot the histogram and cumulative frequency plot in one plot) this can be very useful.

`--onebinstart=FLT`Make sure that one bin starts with the value to this option. In practice, this will shift the bins used to find the histogram and cumulative frequency plot such that one bin’s lower interval becomes this value. For example when the histogram range includes negative and positive values and zero has a special significance in your analysis, then zero will be somewhere in one bin and will mix counts of positive and negative. By setting

`--onebinstart=0`, you can make sure that the viewers of the histogram will not be confused without doing the math of setting a range and number of bins.Note that by default, the first row of the histogram and cumulative frequency plot show the central values of each bin. So in the example above you will not see the 0.000 in the first column, you will see two symmetric values. If the value is not within the usable input range, this option will be ignored.

All the options described until now were from the first class of operations discussed above: those that treat the whole dataset as one. However. It often happens that the relative position of the dataset elements over the dataset is significant. For example you don’t want one median value for the whole input image, you want to know how the median changes over the image. For such operations, the input has to be tessellated (see Tessellation). Thus this class of options can’t currently be called along with the options above in one run of Statistics.

`-t``--ontile`Do the respective single-valued calculation over one tile of the input dataset, not the whole dataset. This option must be called with at least one of the single valued options discussed above (for example

`--mean`or`--quantile`). The output will be a file in the same format as the input. If the`--oneelempertile`option is called, then one element/pixel will be used for each tile (see Processing options). Otherwise, the output will have the same size as the input, but each element will have the value corresponding to that tile’s value. If multiple single valued operations are called, then for each operation there will be one extension in the output FITS file.`-y``--sky`Estimate the Sky value on each tile as fully described in Quantifying signal in a tile. As described in that section, several options are necessary to configure the Sky estimation which are listed below. The output file will have two extensions: the first is the Sky value and the second is the Sky standard deviation on each tile. Similar to

`--ontile`, if the`--oneelempertile`option is called, then one element/pixel will be used for each tile (see Processing options).

The parameters for estimating the sky value can be set with the following
options, except for the `--sclipparams` option (which is also used
by the `--sigmaclip`), the rest are only used for the Sky value
estimation.

`-k=STR``--kernel=STR`File name of kernel to help in estimating the significance of signal in a tile, see Quantifying signal in a tile.

`--khdu=STR`Kernel HDU to help in estimating the significance of signal in a tile, see Quantifying signal in a tile.

`--mirrordist=FLT`Maximum distance (as a multiple of error) to estimate the difference between the input and mirror distributions in finding the mode, see Appendix C of Akhlaghi and Ichikawa 2015, also see Quantifying signal in a tile.

`--modmedqdiff=FLT`The maximum acceptable distance between the mode and median, see Quantifying signal in a tile.

`--sclipparams=FLT,FLT`The \(\sigma\)-clipping parameters, see Sigma clipping. This option takes two values which are separated by a comma (

`,`). Each value can either be written as a single number or as a fraction of two numbers (for example`3,1/10`

). The first value to this option is the multiple of \(\sigma\) that will be clipped (\(\alpha\) in that section). The second value is the exit criteria. If it is less than 1, then it is interpreted as tolerance and if it is larger than one it is a specific number. Hence, in the latter case the value must be an integer.`--smoothwidth=INT`Width of a flat kernel to convolve the interpolated tile values. Tile interpolation is done using the median of the

`--interpnumngb`neighbors of each tile (see Processing options). If this option is given a value of zero or one, no smoothing will be done. Without smoothing, strong boundaries will probably be created between the values estimated for each tile. It is thus good to smooth the interpolated image so strong discontinuities do not show up in the final Sky values. The smoothing is done through convolution (see Convolution process) with a flat kernel, so the value to this option must be an odd number.`--checksky`Create a multi-extension FITS file showing the steps that were used to estimate the Sky value over the input, see Quantifying signal in a tile. The file will have two extensions for each step (one for the Sky and one for the Sky standard deviation).

You can
try it by running the command in the `tests` directory, open the image
with a FITS viewer and have a look at it to get a sense of how these
statistics relate to the input image/dataset.

Previous: Sky value, Up: Statistics [Contents][Index]

JavaScript license information

GNU Astronomy Utilities 0.5 manual, December 2017.