Sigma clipping (GNU Astronomy Utilities)

Next: MAD clipping, Previous: Building inputs and analysis without clipping, Up: Clipping outliers [Contents][Index]

2.10.2 Sigma clipping ¶

Let’s assume that you have pure noise (centered on zero) with a clear Gaussian distribution, or see Photon counting noise. Now let’s assume you add very bright objects (signal) on the image which have a very sharp boundary. By a sharp boundary, we mean that there is a clear cutoff (from the noise) at the pixels the objects finish. In other words, at their boundaries, the objects do not fade away into the noise.

In optical astronomical imaging, cosmic rays (when they collide at a near normal incidence angle) are a very good example of such outliers. The tracks they leave behind in the image are perfectly immune to the blurring caused by the atmosphere on images of stars or galaxies and they have a very high signal-to-noise ratio. They are also very energetic and so their borders are usually clearly separated from the surrounding noise. See Figure 15 in Akhlaghi and Ichikawa, 2015.

In such a case, when you plot the histogram (see Histogram and Cumulative Frequency Plot) of the distribution, the pixels relating to those objects will be clearly separate from pixels that belong to parts of the image that did not have any signal (were just noise). In the cumulative frequency plot, after a steady rise (due to the noise), you would observe a long flat region were for a certain range of data (horizontal axis), there is no increase in the index (vertical axis).

In the previous section (Building inputs and analysis without clipping) we created one such dataset (9.fits). With the command below, let’s have a look at its histogram and cumulative frequency plot (in simple ASCII format; we are decreasing the default number of bins with --numasciibins to show them easily within the width of the print version of this manual; feel free to change this).

$ aststatistics build/in-9.fits --asciihist --asciicfp \
                --numasciibins=65

ASCII Histogram:
Number: 40401
Y: (linear: 0 to 4191)
X: (linear: -31.9714 -- 150.323, in 65 bins)
 |              **
 |             ****
 |            ******
 |            ******
 |           ********
 |           ********
 |          **********
 |         ************
 |        **************
 |      ******************                          ******
 |*******************************         *************************
 |-----------------------------------------------------------------


ASCII Cumulative frequency plot:
Y: (linear: 0 to 40401)
X: (linear: -31.9714 -- 150.323, in 65 bins)
 |                                                  ***************
 |                   **********************************************
 |                  ***********************************************
 |                *************************************************
 |               **************************************************
 |              ***************************************************
 |             ****************************************************
 |            *****************************************************
 |           ******************************************************
 |         ********************************************************
 |*****************************************************************
 |-----------------------------------------------------------------

Outliers like the example above can significantly bias the measurement of the background noise statistics. For example let’s compare the median, mean and standard deviation of the image above with 1.fits:

$ aststatistics build/in-1.fits --median --mean --std
9.90529778313248e+00 9.96143102101206e+00 1.00137568561776e+01

$ aststatistics build/in-9.fits --median --mean --std
1.09305819367634e+01 1.74470443173776e+01 2.88895986970341e+01

The effect of the outliers is obvious in all three measures: the median has become 1.10 times larger, the mean 1.75 times and the standard deviation about 2.88 times! The differing effect of outliers in different statistics was already discussed in Building inputs and analysis without clipping; also see Quantifying signal in a tile.

$\sigma$-clipping is one way to remove/clip the effect of such very strong outliers in measures like the above. $\sigma$-clipping is defined as the very simple iteration below. In each iteration, the range of input data might decrease. When the outliers are as strong as above, the outliers will be removed through this iteration.

Calculate the standard deviation ($\sigma$) and median ($m$) of a distribution. The median is used because, as shown above, the mean is too significantly affected by the presence of outliers.
Remove all points that are smaller or larger than $m\pm\alpha\sigma$.
Go back to step 1, unless the selected exit criteria is reached. There are commonly two types of exit criteria (to stop the $\sigma$-clipping iteration). Within Gnuastro’s programs that use sigma-clipping, the exit criteria is the second value to the --sclipparams option (the first value is the $\alpha$ above):
- When a certain number of iterations has taken place (exit criteria is an integer, larger than 1).
- When the new measured standard deviation is within a certain tolerance level of the previous iteration (exit criteria is floating point and less than 1.0). The tolerance level is defined by:
  $$\sigma_{old}-\sigma_{new} \over \sigma_{new}$$
  
  In each clipping, the dispersion in the distribution is either less or equal. So $\sigma_{old}\geq\sigma_{new}$.

Let’s see the algorithm in practice with the --sigmaclip option of Gnuastro’s Statistics program (using the default configuration of $3\sigma$ clipping and tolerance of 0.1):

$ aststatistics build/in-9.fits --sigmaclip
Statistics (GNU Astronomy Utilities) 0.22
-------
Input: build/in-9.fits (hdu: 1)
-------
3-sigma clipping steps until relative change in STD is less than 0.1:

round number     median       STD
1     40401      1.09306e+01  2.88896e+01
2     37660      1.00306e+01  1.07153e+01
3     37539      1.00080e+01  9.93741e+00
-------
Statistics (after clipping):
  Number of input elements:                40401
  Number of clips:                         2
  Final number of elements:                37539
  Median:                                  1.000803e+01
  Mean:                                    1.001822e+01
  Standard deviation:                      9.937410e+00
  Median Absolute Deviation:               6.772760e+00

After the basic information about the input and settings, the Statistics program has printed the information for each round (iteration) of clipping. Initially, there was 40401 elements (the image is $201\times201$ pixels). After the first round of clipping, only 37660 elements remained and because the difference in standard deviation was larger than the tolerance level, a third clipping was one. But the change in standard deviation after the third clip was smaller than the tolerance level, so the exit criteria was activated and the clipping finished with 37539 elements. In the end, we see that the final median, mean and standard deviation are very similar to the data without any outlier (build/1.fits in the example above).

The example above provided a single statistic from a single dataset. Other scenarios where sigma-clipping becomes necessary are stacking and collapsing (that was the main goal of the script in Building inputs and analysis without clipping). To generate $\sigma$-clipped stacks and collapsed tables, you just need to change the values of the three variables of the script (shown below). After making this change in your favorite text editor, have a look at the outputs:

$ grep ^clip_ script.sh
clip_operator=sigclip    # These variables will be used more
clip_multiple=3          # effectively with the clipping
clip_tolerance=0.1       # operators of the next sections.

$ bash ./script.sh

$ astscript-fits-view build/stack-std.fits \
                      build/stack-sigclip-std.fits \
                      build/stack-*mean.fits \
                      build/stack-*median.fits \
                      build/stack-*number.fits \
                      --ds9extra="-tile grid layout 2 4 -scale minmax"

It is clear that the $\sigma$-clipped results have significantly improved in all four measures (images in the right column in DS9). The reason is clear in the stack-sigclip-number.fits image (which show how many images were used for each pixel): almost all of the outlying circle has been accounted for (the pixel values are 8, not 9, showing 8 images went into those). It is the leaked holes in the stack-sigclip-number.fits image (with value of 9) that keep the circle in the final stack of the other measures (at various levels). See Filled re-clipping for stacking operators that can account for this.

So far, $\sigma$-clipping has preformed nicely. However, there are important caveats to $\sigma$-clipping that are listed in the box below and further elaborated (with examples) afterwards.

Caveats of $\sigma$-clipping: There are some important caveats to $\sigma$-clipping:

The standard deviation is itself heavily influenced by the presence of outliers. Therefore a sufficiently small number of outliers can expand the standard deviation such that they stay within the boundaries.
When the outliers do not constitute a clearly distinct distribution like the example here, sigma-clipping will not be able to separate them like here.

To demonstrate how weaker outliers will not be clipped in sigma clipping, let’s decrease the total sum of values in the outlying circle, then re-run the script:

$ grep ^profsum script.sh
profsum=1e5

$ bash ./script.sh

Let’s have a look at the new outlying circle with the first command below. With the second command, let’s view its pixel value histogram (recall that previously, the circle had a clearly separate distribution):

$ astscript-fits-view build/in-9.fits

$ aststatistics build/in-9.fits --asciihist --numasciibins=65
ASCII Histogram:
Number: 40401
Y: (linear: 0 to 2654)
X: (linear: -31.9714 -- 79.4266, in 65 bins)
 |                        **
 |                      *****
 |                    *********
 |                    **********
 |                  *************
 |                  **************
 |                *****************
 |               *******************
 |             ***********************
 |          ****************************************
 |*****************************************************************
 |-----------------------------------------------------------------

We see that even tough the circle is still clearly visible in the noise, the histogram is not longer separate; it has blended into the noise, and just caused a skewness in the otherwise symmetric noise distribution. Let’s try running the --sigmaclip option as above:

$ aststatistics build/in-9.fits --sigmaclip
Statistics (GNU Astronomy Utilities) 0.22
-------
Input: build/in-9.fits (hdu: 1)
-------
3-sigma clipping steps until relative change in STD is less than 0.1:

round number     median       STD
1     40401      1.09295e+01  1.34784e+01
2     39618      1.06762e+01  1.19852e+01
3     39126      1.05265e+01  1.12983e+01
-------
Statistics (after clipping):
  Number of input elements:                40401
  Number of clips:                         2
  Final number of elements:                39126
  Median:                                  1.052652e+01
  Mean:                                    1.114819e+01
  Standard deviation:                      1.129831e+01
  Median Absolute Deviation:               7.106166e+00

We see that the median, mean and standard deviation are over estimated (each worse than the previous!). Let’s see how the $\sigma$-clipping stacking on this outlying flat circle:

$ astscript-fits-view build/stack-std.fits \
                      build/stack-sigclip-std.fits \
                      build/stack-*mean.fits \
                      build/stack-*median.fits \
                      build/stack-*number.fits \
                      --ds9extra="-tile grid layout 2 4 -scale minmax"

Compared to the previous run (where the outlying circle was brighter), we see that $\sigma$-clipping is now less successful in removing the outlying circle from the stacks; or in the single value measurements. This is particularly visible in the stack-sigclip-number.fits image: the circle barely visible any more: there is only a very weak clustering of pixels with a value of 8 over the circle’s pixels. This has happened because the outliers have biased the standard deviation itself to a level that includes them with this multiple of the standard deviation.

To gauge if $\sigma$-clipping will be useful for your dataset, look at the histogram (see Histogram and Cumulative Frequency Plot). The ASCII histogram that is printed on the command-line with --asciihist (like above) is good enough in most cases. But you can’t do this manually in every case (as in the stacking which involved more than forty thousand pixels)! Clipping outliers should be based on a measure of scatter that is less affected by outliers! Therefore, in Gnuastro we also have median absolute deviation (MAD) clipping which is described in the next section (MAD clipping).