GNU Astronomy Utilities: Histogram and Cumulative Freqency Plot

GNU Astronomy Utilities

7.1.1 Histogram and Cumulative Freqency Plot

Histograms and the cumulative frequency plots are both used to study the distribution of data. The histogram is mainly easier to understand for the untrained eye, while the cumulative frequency plot is more accurate, but needs a good level of experience for interpretation.

A histogram shows the number of data points which lie within pre-defined intervals (bins). It is used to get a general view of the distribution and its shape. The width of the bins is one of the most important parameters for a histogram. In the limiting case that the bin-widths tend to zero (and assuming there is data for each bin), then the normalized histogram would show the probability distribution function of the distribution. Normalizing a histogram means to divide the number of data points in each bin by the total number of data.

In the cumulative frequency plot of a distribution, the x axis is the sorted data values and the y axis is the index of each data in the sorted distribution. Unlike a histogram, a cumulative frequency plot does not involve intervals or bins. This makes it less prone to any sort of bias or error that a given bin-width would have on the analysis. When a larger number of the data points have roughly the same value, then the cumulative frequency plot will become steep in that vicinity. This occurs because on the x axis (data values), there is little change while on the y axis the indexs constantly increase. Normalizing a cumultaive frequency plot means to divide each index (y axis) by the total number of data points.

Unlike the histogram which has a limited number of bins, ideally the cumulative frequency plot should have one point for every data point. Even in small images (for example a \(200\times200\)) this will result in an unreasonably larger number of points to plot (40000)! So when the cumulative frequency plot of an image is stored in a text file, it is best to only store its value on a certain number of points (intervals) rather than the whole data. The number of points to use for the final plot can be specified with the --cfpnum option.

Note that the interval’s lower value is considered to be part of each interval, but its larger value is not. Formally, an interval between a and b is represented by [a, b). This is true for all the intervals except the last one. The last interval is closed or [a, b].

Warning: This page uses MathJax to render TeX equations. MathJax requires JavaScript for the rendering. However, scripts are disabled.

To see the equations, you can either use LibreJS to allow trusted scripts, or get the full manual in PDF.

Read in other formats.
JavaScript license information
GNU Astronomy Utilities manual, May 2016.