## GNU Astronomy Utilities

Previous: , Up: Statistics   [Contents][Index]

#### 7.1.5 Invoking Statistics

Statistics will print statistical measures of an input dataset (table column or image). The executable name is aststatistics with the following general template

$aststatistics [OPTION ...] InputImage.fits  One line examples: ## Print some general statistics of input image:$ aststatistics image.fits

## Print some general statistics of column named MAG_F160W:
$aststatistics catalog.fits -h1 --column=MAG_F160W ## Make the histogram of the column named MAG_F160W:$ aststatistics table.fits -cMAG_F160W --histogram

## Find the Sky value on image with a given kernel:
$aststatistics image.fits --sky --kernel=kernel.fits ## Print Sigma-clipped results of records with a MAG_F160W ## column value between 26 and 27:$ aststatistics cat.fits -cMAG_F160W -g26 -l27 --sigmaclip=3,0.2

## Print the median value of all records in column MAG_F160W that
## have a value larger than 3 in column PHOTO_Z:
$aststatistics tab.txt -rPHOTO_Z -g3 -cMAG_F160W --median ## Calculate the median of the third column in the input table, but only ## for rows where the mean of the first and second columns is >5.$ awk '($1+$2)/2 > 5 {print $3}' table.txt | aststatistics --median  Statistics can take its input dataset either from a file (image or table) or the Standard input (see Standard input). If any output file is to be created, the value to the --output option, is used as the base name for the generated files. Without --output, the input name will be used to generate an output name, see Automatic output. The options described below are particular to Statistics, but for general operations, it shares a large collection of options with the other Gnuastro programs, see Common options for the full list. For more on reading from standard input, please see the description of --stdintimeout option in Input/Output options. Options can also be given in configuration files, for more, please see Configuration files. The input dataset may have blank values (see Blank pixels), in this case, all blank pixels are ignored during the calculation. Initially, the full dataset will be read, but it is possible to select a specific range of data elements to use in the analysis of each run. You can either directly specify a minimum and maximum value for the range of data elements to use (with --greaterequal or --lessthan), or specify the range using quantiles (with --qrange). If a range is specified, all pixels outside of it are ignored before any processing. The following set of options are for specifying the input/outputs of Statistics. There are many other input/output options that are common to all Gnuastro programs including Statistics, see Input/Output options for those. -c STR/INT --column=STR/INT The column to use when the input file is a table with more than one column. See Selecting table columns for a full description of how to use this option. For more on how tables are read in Gnuastro, please see Tables. -r STR/INT --refcol=STR/INT The reference column selector when the input file is a table. When a reference column is given, the range options below will be applied to this column and only elements in the input column that have a reference value in the correct range will be used. In practice this option allows you to select a subset of the input column based on values in another (the reference) column. All the statistical calculations will be done on the selected input column, not the reference column. -g FLT --greaterequal=FLT Limit the range of inputs into those with values greater and equal to what is given to this option. None of the values below this value will be used in any of the processing steps below. -l FLT --lessthan=FLT Limit the range of inputs into those with values less-than what is given to this option. None of the values greater or equal to this value will be used in any of the processing steps below. -Q FLT[,FLT] --qrange=FLT[,FLT] Specify the range of usable inputs using the quantile. This option can take one or two quantiles to specify the range. When only one number is input (let’s call it $$Q$$), the range will be those values in the quantile range $$Q$$ to $$1-Q$$. So when only one value is given, it must be less than 0.5. When two values are given, the first is used as the lower quantile range and the second is used as the larger quantile range. The quantile of a given element in a dataset is defined by the fraction of its index to the total number of values in the sorted input array. So the smallest and largest values in the dataset have a quantile of 0.0 and 1.0. The quantile is a very useful non-parametric (making no assumptions about the input) relative measure to specify a range. It can best be understood in terms of the cumulative frequency plot, see Histogram and Cumulative Frequency Plot. The quantile of each horizontal axis value in the cumulative frequency plot is the vertical axis value associate with it. When no operation is requested, Statistics will print some general basic properties of the input dataset on the command-line like the example below (ran on one of the output images of make check128). This default behavior is designed to help give you a general feeling of how the data are distributed and help in narrowing down your analysis. $ aststatistics convolve_spatial_scaled_noised.fits     \
--greaterequal=9500 --lessthan=11000
Statistics (GNU Astronomy Utilities) X.X
-------
Input: convolve_spatial_scaled_noised.fits (hdu: 0)
Range: from (inclusive) 9500, upto (exclusive) 11000.
Unit: Brightness
-------
Number of elements:                      9074
Minimum:                                 9622.35
Maximum:                                 10999.7
Mode:                                    10055.45996
Mode quantile:                           0.4001983908
Median:                                  10093.7
Mean:                                    10143.98257
Standard deviation:                      221.80834
-------
Histogram:
|                   **
|                 ******
|                 *******
|                *********
|              *************
|              **************
|            ******************
|            ********************
|          *************************** *
|        ***************************************** ***
|*  **************************************************************
|-----------------------------------------------------------------


Gnuastro’s Statistics is a very general purpose program, so to be able to easily understand this diversity in its operations (and how to possibly run them together), we’ll divided the operations into two types: those that don’t respect the position of the elements and those that do (by tessellating the input on a tile grid, see Tessellation). The former treat the whole dataset as one and can re-arrange all the elements (for example sort them), but the former do their processing on each tile independently. First, we’ll review the operations that work on the whole dataset.

The group of options below can be used to get single value measurement(s) of the whole dataset. They will print only the requested value as one field in a line/row, like the --mean, --median options. These options can be called any number of times and in any order. The outputs of all such options will be printed on one line following each other (with a space character between them). This feature makes these options very useful in scripts, or to redirect into programs like GNU AWK for higher-level processing. These are some of the most basic measures, Gnuastro is still under heavy development and this list will grow. If you want another statistical parameter, please contact us and we will do out best to add it to this list, see Suggest new feature.

-n
--number

Print the number of all used (non-blank and in range) elements.

--minimum

Print the minimum value of all used elements.

--maximum

Print the maximum value of all used elements.

--sum

Print the sum of all used elements.

-m
--mean

Print the mean (average) of all used elements.

-t
--std

Print the standard deviation of all used elements.

-E
--median

Print the median of all used elements.

-u FLT[,FLT[,...]]
--quantile=FLT[,FLT[,...]]

Print the values at the given quantiles of the input dataset. Any number of quantiles may be given and one number will be printed for each. Values can either be written as a single number or as fractions, but must be between zero and one (inclusive). Hence, in effect --quantile=0.25 --quantile=0.75 is equivalent to --quantile=0.25,3/4, or -u1/4,3/4.

The returned value is one of the elements from the dataset. Taking $$q$$ to be your desired quantile, and $$N$$ to be the total number of used (non-blank and within the given range) elements, the returned value is at the following position in the sorted array: $$round(q\times{}N$$).

--quantfunc=FLT[,FLT[,...]]

Print the quantiles of the given values in the dataset. This option is the inverse of the --quantile and operates similarly except that the acceptable values are within the range of the dataset, not between 0 and 1. Formally it is known as the “Quantile function”.

Since the dataset is not continuous this function will find the nearest element of the dataset and use its position to estimate the quantile function.

-O
--mode

Print the mode of all used elements. The mode is found through the mirror distribution which is fully described in Appendix C of Akhlaghi and Ichikawa 2015. See that section for a full description.

This mode calculation algorithm is non-parametric, so when the dataset is not large enough (larger than about 1000 elements usually), or doesn’t have a clear mode it can fail. In such cases, this option will return a value of nan (for the floating point NaN value).

As described in that paper, the easiest way to assess the quality of this mode calculation method is to use it’s symmetricity (see --modesym below). A better way would be to use the --mirror option to generate the histogram and cumulative frequency tables for any given mirror value (the mode in this case) as a table. If you generate plots like those shown in Figure 21 of that paper, then your mode is accurate.

--modequant

Print the quantile of the mode. You can get the actual mode value from the --mode described above. In many cases, the absolute value of the mode is irrelevant, but its position within the distribution is important. In such cases, this option will become handy.

--modesym

Print the symmetricity of the calculated mode. See the description of --mode for more. This mode algorithm finds the mode based on how symmetric it is, so if the symmetricity returned by this option is too low, the mode is not too accurate. See Appendix C of Akhlaghi and Ichikawa 2015 for a full description. In practice, symmetricity values larger than 0.2 are mostly good.

--modesymvalue

Print the value in the distribution where the mirror and input distributions are no longer symmetric, see --mode and Appendix C of Akhlaghi and Ichikawa 2015 for more.

--sigclip-number

Number of elements after applying $$\sigma$$-clipping (see Sigma clipping). $$\sigma$$-clipping configuration is done with the --sigclipparams option.

--sigclip-median

Median after applying $$\sigma$$-clipping (see Sigma clipping). $$\sigma$$-clipping configuration is done with the --sigclipparams option.

Here is one scenario where this can be useful: assume you have a table and you would like to remove the rows that are outliers (not within the $$\sigma$$-clipping range). Let’s assume your table is called table.fits and you only want to keep the rows that have a value in COLUMN within the $$\sigma$$-clipped range (to $$3\sigma$$, with a tolerance of 0.1). This command will return the $$\sigma$$-clipped median and standard deviation (used to define the range later).

$aststatistics table.fits -cCOLUMN --sclipparams=3,0.1 \ --sigclip-median --sigclip-std  You can then use the --range option of Table (see Table) to select the proper rows. But for that, you need the actual starting and ending values of the range ($$m\pm s\sigma$$; where $$m$$ is the median and $$s$$ is the multiple of sigma to define an outlier). Therefore, the raw outputs of Statistics in the command above aren’t enough. To get the starting and ending values of the non-outlier range (and put a ‘,’ between them, ready to be used in --range), pipe the result into AWK. But in AWK, we’ll also need the multiple of $$\sigma$$, so we’ll define it as a shell variable (s) before calling Statistics (note how $s is used two times now):

$s=3$ aststatistics table.fits -cCOLUMN --sclipparams=$s,0.1 \ --sigclip-median --sigclip-std \ | awk '{s='$s'; printf("%f,%f\n", $1-s*$2, $1+s*$2)}'


To pass it onto Table, we’ll need to keep the printed output from the command above in another shell variable (r), not print it. In Bash, can do this by putting the whole statement within a $(): $ s=3
$r=$(aststatistics table.fits -cCOLUMN --sclipparams=$s,0.1 \ --sigclip-median --sigclip-std \ | awk '{s='$s'; printf("%f,%f\n", $1-s*$2, $1+s*$2)}')
$echo$r      # Just to confirm.


Now you can use Table with the --range option to only print the rows that have a value in COLUMN within the desired range:

$asttable table.fits --range=COLUMN,$r


To save the resulting table (that is clean of outliers) in another file (for example named cleaned.fits, it can also have a .txt suffix), just add --output=cleaned.fits to the command above.

--sigclip-mean

Mean after applying $$\sigma$$-clipping (see Sigma clipping). $$\sigma$$-clipping configuration is done with the --sigclipparams option.

--sigclip-std

Standard deviation after applying $$\sigma$$-clipping (see Sigma clipping). $$\sigma$$-clipping configuration is done with the --sigclipparams option.

The list of options below are for those statistical operations that output more than one value. So while they can be called together in one run, their outputs will be distinct (each one’s output will usually be printed in more than one line).

-A
--asciihist

Print an ASCII histogram of the usable values within the input dataset along with some basic information like the example below (from the UVUDF catalog129). The width and height of the histogram (in units of character widths and heights on your command-line terminal) can be set with the --numasciibins (for the width) and --asciiheight options.

For a full description of the histogram, please see Histogram and Cumulative Frequency Plot. An ASCII plot is certainly very crude and cannot be used in any publication, but it is very useful for getting a general feeling of the input dataset very fast and easily on the command-line without having to take your hands off the keyboard (which is a major distraction!). If you want to try it out, you can write it all in one line and ignore the \ and extra spaces.

$aststatistics uvudf_rafelski_2015.fits.gz --hdu=1 \ --column=MAG_F160W --lessthan=40 \ --asciihist --numasciibins=55 ASCII Histogram: Number: 8593 Y: (linear: 0 to 660) X: (linear: 17.7735 -- 31.4679, in 55 bins) | **** | ***** | ****** | ******** | ********* | *********** | ************** | ***************** | *********************** | ******************************** |*** *************************************************** |-------------------------------------------------------  --asciicfp Print the cumulative frequency plot of the usable elements in the input dataset. Please see descriptions under --asciihist for more, the example below is from the same input table as that example. To better understand the cumulative frequency plot, please see Histogram and Cumulative Frequency Plot. $ aststatistics uvudf_rafelski_2015.fits.gz --hdu=1         \
--column=MAG_F160W --lessthan=40            \
--asciicfp --numasciibins=55
ASCII Cumulative frequency plot:
Y: (linear: 0 to 8593)
X: (linear: 17.7735 -- 31.4679, in 55 bins)
|                                                *******
|                                             **********
|                                            ***********
|                                          *************
|                                         **************
|                                        ***************
|                                      *****************
|                                    *******************
|                                ***********************
|                         ******************************
|*******************************************************
|-------------------------------------------------------

-H
--histogram

Save the histogram of the usable values in the input dataset into a table. The first column is the value at the center of the bin and the second is the number of points in that bin. If the --cumulative option is also called with this option in a run, then the table will have three columns (the third is the cumulative frequency plot). Through the --numbins, --onebinstart, or --manualbinrange, you can modify the first column values and with --normalize and --maxbinone you can modify the second columns. See below for the description of each.

By default (when no --output is specified) a plain text table will be created, see Gnuastro text table format. If a FITS name is specified, you can use the common option --tableformat to have it as a FITS ASCII or FITS binary format, see Common options. This table can then be fed into your favorite plotting tool and get a much more clean and nice histogram than what the raw command-line can offer you (with the --asciihist option).

--histogram2d

Save the 2D histogram of two input columns into an output file, see 2D Histograms. The output will have three columns: the first two are the coordinates of each box’s center in the first and second dimensions/columns. The third will be number of input points that fall within that box.

-C
--cumulative

Save the cumulative frequency plot of the usable values in the input dataset into a table, similar to --histogram.

-s
--sigmaclip

Do $$\sigma$$-clipping on the usable pixels of the input dataset. See Sigma clipping for a full description on $$\sigma$$-clipping and also to better understand this option. The $$\sigma$$-clipping parameters can be set through the --sclipparams option (see below).

--mirror=FLT

Make a histogram and cumulative frequency plot of the mirror distribution for the given dataset when the mirror is located at the value to this option. The mirror distribution is fully described in Appendix C of Akhlaghi and Ichikawa 2015 and currently it is only used to calculate the mode (see --mode).

Just note that the mirror distribution is a discrete distribution like the input, so while you may give any number as the value to this option, the actual mirror value is the closest number in the input dataset to this value. If the two numbers are different, Statistics will warn you of the actual mirror value used.

This option will make a table as output. Depending on your selected name for the output, it will be either a FITS table or a plain text table (which is the default). It contains three columns: the first is the center of the bins, the second is the histogram (with the largest value set to 1) and the third is the normalized cumulative frequency plot of the mirror distribution. The bins will be positioned such that the mode is on the starting interval of one of the bins to make it symmetric around the mirror. With this output file and the input histogram (that you can generate in another run of Statistics, using the --onebinvalue), it is possible to make plots like Figure 21 of Akhlaghi and Ichikawa 2015.

The list of options below allow customization of the histogram and cumulative frequency plots (for the --histogram, --cumulative, --asciihist, and --asciicfp options).

--numbins

The number of bins (rows) to use in the histogram and the cumulative frequency plot tables (outputs of --histogram and --cumulative).

--numasciibins

The number of bins (characters) to use in the ASCII plots when printing the histogram and the cumulative frequency plot (outputs of --asciihist and --asciicfp).

--asciiheight

The number of lines to use when printing the ASCII histogram and cumulative frequency plot on the command-line (outputs of --asciihist and --asciicfp).

-n
--normalize

Normalize the histogram or cumulative frequency plot tables (outputs of --histogram and --cumulative). For a histogram, the sum of all bins will become one and for a cumulative frequency plot the last bin value will be one.

--maxbinone

Divide all the histogram values by the maximum bin value so it becomes one and the rest are similarly scaled. In some situations (for example if you want to plot the histogram and cumulative frequency plot in one plot) this can be very useful.

--onebinstart=FLT

Make sure that one bin starts with the value to this option. In practice, this will shift the bins used to find the histogram and cumulative frequency plot such that one bin’s lower interval becomes this value.

For example when a histogram range includes negative and positive values and zero has a special significance in your analysis, then zero might fall somewhere in one bin. As a result that bin will have counts of positive and negative. By setting --onebinstart=0, you can make sure that one bin will only count negative values in the vicinity of zero and the next bin will only count positive ones in that vicinity.

Note that by default, the first row of the histogram and cumulative frequency plot show the central values of each bin. So in the example above you will not see the 0.000 in the first column, you will see two symmetric values.

If the value is not within the usable input range, this option will be ignored. When it is, this option is the last operation before the bins are finalized, therefore it has a higher priority than options like --manualbinrange.

--manualbinrange

Use the values given to the --greaterequal and --lessthan to define the range of all bin-based calculations like the histogram. This option itself doesn’t take any value, but just tells the program to use the values of those two options instead of the minimum and maximum values of a plot. If any of the two options are not given, then the minimum or maximum will be used respectively. Therefore, if none of them are called calling this option is redundant.

The --onebinstart option has a higher priority than this option. In other words, --onebinstart takes effect after the range has been finalized and the initial bins have been defined, therefore it has the power to (possibly) shift the bins. If you want to manually set the range of the bins and have one bin on a special value, it is thus better to avoid --onebinstart.

--numbins2=INT

Similar to --numbins, but for the second column when a 2D histogram is requested, see --histogram2d.

--greaterequal2=FLT

Similar to --greaterequal, but for the second column when a 2D histogram is requested, see --histogram2d.

--lessthan2=FLT

Similar to --lessthan, but for the second column when a 2D histogram is requested, see --histogram2d.

--onebinstart2=FLT

Similar to --onebinstart, but for the second column when a 2D histogram is requested, see --histogram2d.

All the options described until now were from the first class of operations discussed above: those that treat the whole dataset as one. However, it often happens that the relative position of the dataset elements over the dataset is significant. For example you don’t want one median value for the whole input image, you want to know how the median changes over the image. For such operations, the input has to be tessellated (see Tessellation). Thus this class of options can’t currently be called along with the options above in one run of Statistics.

-t
--ontile

Do the respective single-valued calculation over one tile of the input dataset, not the whole dataset. This option must be called with at least one of the single valued options discussed above (for example --mean or --quantile). The output will be a file in the same format as the input. If the --oneelempertile option is called, then one element/pixel will be used for each tile (see Processing options). Otherwise, the output will have the same size as the input, but each element will have the value corresponding to that tile’s value. If multiple single valued operations are called, then for each operation there will be one extension in the output FITS file.

-R FLT[,FLT[,FLT...]]
--contour=FLT[,FLT[,FLT...]]

Write the contours for the requested levels in a file ending with _contour.txt. It will have three columns: the first two are the coordinates of each point and the third is the level it belongs to (one of the input values). Each disconnected contour region will be separated by a blank line. This is the requested format for adding contours with PGFPlots in LaTeX. If any other format can be useful for your work please let us know so we can add it. If the image has World Coordinate System information, the written coordinates will be in RA and Dec, otherwise, they will be in pixel coordinates.

Note that currently, this is a very crude/simple implementation, please let us know if you find problematic situations so we can fix it.

-y
--sky

Estimate the Sky value on each tile as fully described in Quantifying signal in a tile. As described in that section, several options are necessary to configure the Sky estimation which are listed below. The output file will have two extensions: the first is the Sky value and the second is the Sky standard deviation on each tile. Similar to --ontile, if the --oneelempertile option is called, then one element/pixel will be used for each tile (see Processing options).

The parameters for estimating the sky value can be set with the following options, except for the --sclipparams option (which is also used by the --sigmaclip), the rest are only used for the Sky value estimation.

-k=STR
--kernel=STR

File name of kernel to help in estimating the significance of signal in a tile, see Quantifying signal in a tile.

--khdu=STR

Kernel HDU to help in estimating the significance of signal in a tile, see Quantifying signal in a tile.

--meanmedqdiff=FLT

The maximum acceptable distance between the quantiles of the mean and median, see Quantifying signal in a tile. The initial Sky and its standard deviation estimates are measured on tiles where the quantiles of their mean and median are less distant than the value given to this option. For example --meanmedqdiff=0.01 means that only tiles where the mean’s quantile is between 0.49 and 0.51 (recall that the median’s quantile is 0.5) will be used.

--sclipparams=FLT,FLT

The $$\sigma$$-clipping parameters, see Sigma clipping. This option takes two values which are separated by a comma (,). Each value can either be written as a single number or as a fraction of two numbers (for example 3,1/10). The first value to this option is the multiple of $$\sigma$$ that will be clipped ($$\alpha$$ in that section). The second value is the exit criteria. If it is less than 1, then it is interpreted as tolerance and if it is larger than one it is a specific number. Hence, in the latter case the value must be an integer.

--outliersclip=FLT,FLT

$$\sigma$$-clipping parameters for the outlier rejection of the Sky value (similar to --sclipparams).

Outlier rejection is useful when the dataset contains a large and diffuse (almost flat within each tile) signal. The flatness of the profile will cause it to successfully pass the mean-median quantile difference test, so we’ll need to use the distribution of successful tiles for removing these false positive. For more, see the latter half of Quantifying signal in a tile.

--outliersigma=FLT

Multiple of sigma to define an outlier in the Sky value estimation. If this option is given a value of zero, no outlier rejection will take place. For more see --outliersclip and the latter half of Quantifying signal in a tile.

--smoothwidth=INT

Width of a flat kernel to convolve the interpolated tile values. Tile interpolation is done using the median of the --interpnumngb neighbors of each tile (see Processing options). If this option is given a value of zero or one, no smoothing will be done. Without smoothing, strong boundaries will probably be created between the values estimated for each tile. It is thus good to smooth the interpolated image so strong discontinuities do not show up in the final Sky values. The smoothing is done through convolution (see Convolution process) with a flat kernel, so the value to this option must be an odd number.

--ignoreblankintiles

Don’t set the input’s blank pixels to blank in the tiled outputs (for example Sky and Sky standard deviation extensions of the output). This is only applicable when the tiled output has the same size as the input, in other words, when --oneelempertile isn’t called.

By default, blank values in the input (commonly on the edges which are outside the survey/field area) will be set to blank in the tiled outputs also. But in other scenarios this default behavior is not desired: for example if you have masked something in the input, but want the tiled output under that also.

--checksky

Create a multi-extension FITS file showing the steps that were used to estimate the Sky value over the input, see Quantifying signal in a tile. The file will have two extensions for each step (one for the Sky and one for the Sky standard deviation).

#### Footnotes

##### (128)

You can try it by running the command in the tests directory, open the image with a FITS viewer and have a look at it to get a sense of how these statistics relate to the input image/dataset.

##### (129)

https://asd.gsfc.nasa.gov/UVUDF/uvudf_rafelski_2015.fits.gz

Previous: , Up: Statistics   [Contents][Index]