Next: Sigma clipping, Previous: Statistics, Up: Statistics [Contents][Index]

Histograms and the cumulative frequency plots are both used to visually
study the distribution of a dataset. A histogram shows the number of data
points which lie within pre-defined intervals (bins). So on the horizontal
axis we have the bin centers and on the vertical, the number of points that
are in that bin. You can use it to get a general view of the distribution:
which values have been repeated the most? how close/far are the most
significant bins? Are there more values in the larger part of the range of
the dataset, or in the lower part? Similarly, many very important
properties about the dataset can be deduced from a visual inspection of the
histogram. In the Statistics program, the histogram can be either output to
a table to plot with your favorite plotting program^{119}, or
it can be shown with ASCII characters on the command-line, which is very
crude, but good enough for a fast and on-the-go analysis, see the example
in Invoking Statistics.

The width of the bins is only necessary parameter for a histogram. In the
limiting case that the bin-widths tend to zero (while assuming the number
of points in the dataset tend to infinity), then the histogram will tend to
the probability density function of the distribution. When the absolute number
of points in each bin is not relevant to the study (only the shape of the
histogram is important), you can *normalize* a histogram so like the
probability density function, the sum of all its bins will be one.

In the cumulative frequency plot of a distribution, the horizontal axis is the sorted data values and the y axis is the index of each data in the sorted distribution. Unlike a histogram, a cumulative frequency plot does not involve intervals or bins. This makes it less prone to any sort of bias or error that a given bin-width would have on the analysis. When a larger number of the data points have roughly the same value, then the cumulative frequency plot will become steep in that vicinity. This occurs because on the horizontal axis, there is little change while on the vertical axis, the indexes constantly increase. Normalizing a cumulative frequency plot means to divide each index (y axis) by the total number of data points (or the last value).

Unlike the histogram which has a limited number of bins, ideally the
cumulative frequency plot should have one point for every data
element. Even in small datasets (for example a \(200\times200\) image)
this will result in an unreasonably large number of points to plot (40000)!
As a result, for practical reasons, it is common to only store its value on
a certain number of points (intervals) in the input range rather than the
whole dataset, so you should determine the number of bins you want when
asking for a cumulative frequency plot. In Gnuastro (and thus the
Statistics program), the number reported for each bin is the total number
of data points until the larger interval value for that bin. You can see an
example histogram and cumulative frequency plot of a single dataset under
the `--asciihist` and `--asciicfp` options of Invoking Statistics.

So as a summary, both the histogram and cumulative frequency plot in
Statistics will work with bins. Within each bin/interval, the lower value
is considered to be within then bin (it is inclusive), but its larger value
is not (it is exclusive). Formally, an interval/bin between a and b is
represented by [a, b). When the over-all range of the dataset is specified
(with the `--greaterequal`, `--lessthan`, or
`--qrange` options), the acceptable values of the dataset are also
defined with a similar inclusive-exclusive manner. But when the range is
determined from the actual dataset (none of these options is called), the
last element in the dataset is included in the last bin’s count.

We recommend PGFPlots which generates your plots directly within TeX (the same tool that generates your document).

Next: Sigma clipping, Previous: Statistics, Up: Statistics [Contents][Index]

JavaScript license information

GNU Astronomy Utilities 0.7 manual, August 2018.