Sigma clipping (GNU Astronomy Utilities)

Next: MAD clipping, Previous: Building inputs and analysis without clipping, Up: Clipping outliers [Contents][Index]

2.10.2 Sigma clipping ¶

Let’s assume that you have pure noise (centered on zero) with a clear Gaussian distribution, or see Photon counting noise. Now let’s assume you add very bright objects (signal) on the image which have a very sharp boundary. By a sharp boundary, we mean that there is a clear cutoff (from the noise) at the pixels the objects finish. In other words, at their boundaries, the objects do not fade away into the noise.

In optical astronomical imaging, cosmic rays (when they collide at a near normal incidence angle) are a very example of such outliers. The tracks they leave behind in the image are perfectly immune to the blurring caused by the atmosphere on images of stars or galaxies and they have a very high signal-to-noise ratio. They are also very energetic and so their borders are usually clearly separated from the surrounding noise. See Figure 15 in Akhlaghi and Ichikawa, 2015.

In such a case, when you plot the histogram (see Histogram and Cumulative Frequency Plot) of the distribution, the pixels relating to those objects will be clearly separate from pixels that belong to parts of the image that did not have any signal (were just noise). In the cumulative frequency plot, after a steady rise (due to the noise), you would observe a long flat region were for a certain range of the dynamic range (horizontal axis), there is no increase in the index (vertical axis).

In the previous section (Building inputs and analysis without clipping) we created one such dataset (in-9.fits). With the command below, let’s have a look at its histogram and cumulative frequency plot (in simple ASCII format; we are decreasing the default number of bins with --numasciibins to show them easily within the width of the print version of this manual; feel free to change this).

$ aststatistics build/in-9.fits --asciihist --asciicfp \
                --numasciibins=65

ASCII Histogram:
Number: 40401
Y: (linear: 0 to 4191)
X: (linear: -31.9714 -- 150.323, in 65 bins)
 |              **
 |             ****
 |            ******
 |            ******
 |           ********
 |           ********
 |          **********
 |         ************
 |        **************
 |      ******************                          ******
 |*******************************         *************************
 |-----------------------------------------------------------------


ASCII Cumulative frequency plot:
Y: (linear: 0 to 40401)
X: (linear: -31.9714 -- 150.323, in 65 bins)
 |                                                  ***************
 |                   **********************************************
 |                  ***********************************************
 |                *************************************************
 |               **************************************************
 |              ***************************************************
 |             ****************************************************
 |            *****************************************************
 |           ******************************************************
 |         ********************************************************
 |*****************************************************************
 |-----------------------------------------------------------------

We see a clear bimodal distribution in the histogram. Such outliers can significantly bias any single measurement over the whole dataset. For example let’s compare the median, mean and standard deviation of the image above with 1.fits:

$ aststatistics build/in-1.fits --median --mean --std
9.90529778313248e+00 9.96143102101206e+00 1.00137568561776e+01

$ aststatistics build/in-9.fits --median --mean --std
1.09305819367634e+01 1.74470443173776e+01 2.88895986970341e+01

The effect of the outliers is obvious in all three measures: the median has become 1.10 times larger, the mean 1.75 times and the standard deviation about 2.88 times! The differing effect of outliers in different statistics was already discussed in Building inputs and analysis without clipping; also see Quantifying signal in a tile.

$\sigma$-clipping is one commonly used way to remove/clip the effect of such very strong outliers in measures like the above (although not the most robust, continue reading to the end of this tutorial in the next sections). $\sigma$-clipping is defined as the very simple iteration below. In each iteration, the number of input values used might decrease. When the outliers are as strong as above, the outliers will be removed through this iteration.

Calculate the standard deviation ($\sigma$) and median ($m$) of a distribution. The median is used because, as shown above, the mean is too significantly affected by the presence of outliers.
Remove all points that are smaller or larger than $m\pm\alpha\sigma$.
Go back to step 1, unless the selected exit criteria is reached. There are commonly two types of exit criteria (to stop the $\sigma$-clipping iteration). Within Gnuastro’s programs that use sigma-clipping, the exit criteria is the second value to the --sclipparams option (the first value is the $\alpha$ above):
- When a certain number of iterations has taken place (exit criteria is an integer, larger than 1). For example --sclipparams=5,3 means that the $5\sigma$ clipping will stop after 3 clips.
- When the new measured standard deviation is within a certain tolerance level of the previous iteration (exit criteria is floating point and less than 1.0). The tolerance level is defined by:
  $$\sigma_{old}-\sigma_{new} \over \sigma_{new}$$
  
  In each clipping, the dispersion in the distribution is either less or equal. So $\sigma_{old}\geq\sigma_{new}$. For example --sclipparams=5,0.2 means that the $5\sigma$ clipping will stop the old and new standard deviations are equal within a factor of $0.2$.

Let’s see the algorithm in practice with the --sigmaclip option of Gnuastro’s Statistics program (using the default configuration of $3\sigma$ clipping and tolerance of 0.1):

$ aststatistics build/in-9.fits --sigmaclip
Statistics (GNU Astronomy Utilities) 0.24
-------
Input: build/in-9.fits (hdu: 1)
-------
3-sigma clipping steps until relative change in STD is less than 0.1:

round number     median       STD
1     40401      1.09306e+01  2.88896e+01
2     37660      1.00306e+01  1.07153e+01
3     37539      1.00080e+01  9.93741e+00
-------
Statistics (after clipping):
  Number of input elements:                40401
  Number of clips:                         2
  Final number of elements:                37539
  Median:                                  1.000803e+01
  Mean:                                    1.001822e+01
  Standard deviation:                      9.937410e+00
  Median Absolute Deviation:               6.772760e+00

After the basic information about the input and settings, the Statistics program has printed the information for each round (iteration) of clipping. Initially, there was 40401 elements (the image is $201\times201$ pixels). After the first round of clipping, only 37660 elements remained and because the difference in standard deviation was larger than the tolerance level, a third clipping was one. But the change in standard deviation after the third clip (in relation to the second) was smaller than the tolerance level, so the exit criteria was activated and the clipping finished with 37539 elements. In the end, we see that the final median, mean and standard deviation are very similar to the data without any outlier (build/1.fits in the example above). Therefore, through clipping we were able to remove the second “outlier” distribution from the bimodal histogram above (because it was so nicely separated from the main/noise).

The example above provided a single statistic from a single dataset. Other scenarios where sigma-clipping becomes necessary are coadding and collapsing (that was the main goal of the script in Building inputs and analysis without clipping). To generate $\sigma$-clipped coadds and collapsed tables, you just need to change the values of the three variables of the script (shown below). After making this change in your favorite text editor, have a look at the outputs. By the way, if you have still not read (and understood) the commands in that script, this is a good time to do it so the steps below do not appear as a black box to you (for more on writing shell scripts, see Writing scripts to automate the steps).

$ grep ^clip_ script.sh
clip_operator=sigclip    # These variables will be used more
clip_multiple=3          # effectively with the clipping
clip_tolerance=0.1       # operators of the next sections.

$ bash ./script.sh

$ astscript-fits-view build/coadd-std.fits \
                      build/coadd-sigclip-std.fits \
                      build/coadd-*mean.fits \
                      build/coadd-*median.fits \
                      --ds9extra="-tile grid layout 2 3 -scale minmax"

You will see 6 images arranged in two columns: the first column is the normal coadd (without $\sigma$-clipping) and the second column is the $\sigma$-clipped coadd of the same statistic (first row: standard deviation, second row: mean, third row: median).

It is clear that the $\sigma$-clipped (right column in DS9) results have improved in all three measures (compared to the left column). This was achieved by clipping/removing outliers. To see how many input images were used in each pixel of the clipped coadd, you should look into the second HDU of any clipping output which shows the number of inputs that were used for each pixel:

$ astscript-fits-view build/coadd-sigclip-median.fits \
                      --ds9extra="-scale minmax"

In the “Cube” window of opened DS9, click on the “Next” button. The pixels in this image only have two values: 8 or 9. Over the footprint of the circle, most pixels have a value of 8: only 8 inputs were used for these (one of the inputs was clipped out). In the other regions of the image, you see that the pixels almost consistently have a value of 9 (except for some noisy pixels here and there).

It is the “holes” (with value 9) within the footprint of the circle that keep the circle visible in the final coadd of the output (as we saw previously in the 2-column DS9 command before). Spoiler alert: in a later section of this tutorial (Contiguous outliers) you will see how we fix this problem. But please be patient and continue reading and running the commands for now.

So far, $\sigma$-clipping seems to have preformed nicely. However, there are important caveats to $\sigma$-clipping that are listed in the box below and further elaborated (with examples) afterwards.

Caveats of $\sigma$-clipping: continue this section to visually see the effect of both these caveats:

The standard deviation is itself heavily influenced by the presence of outliers (as we have shown above). Therefore a sufficiently small number of outliers can cause an over-estimation of the standard deviation. This can be strong enough to keep those from getting clipped!
When the outliers do not constitute a clearly distinct distribution like the example here, $\sigma$-clipping will not be able to separate them (see the bimodal histogram above for situations that $\sigma$-clipping is useful).

To demonstrate the caveats above, let’s decrease the brightness (total sum of values) in the circle by decreasing the value of the profsum variable in the script:

$ grep ^profsum script.sh
profsum=1e5

$ bash ./script.sh

First, let’s have a look at the new circle in noise with the first command below. With the second command, let’s view its pixel value histogram (recall that previously, the circle had a clearly separate distribution):

$ astscript-fits-view build/in-9.fits

$ aststatistics build/in-9.fits --asciihist --numasciibins=65
ASCII Histogram:
Number: 40401
Y: (linear: 0 to 2654)
X: (linear: -31.9714 -- 79.4266, in 65 bins)
 |                        **
 |                      *****
 |                    *********
 |                    **********
 |                  *************
 |                  **************
 |                *****************
 |               *******************
 |             ***********************
 |          ****************************************
 |*****************************************************************
 |-----------------------------------------------------------------

We see that even tough the circle is still clearly visible in the noise in DS9, we don’t have a bimodal histogram any more; the circle’s pixels have blended into the noise, and just caused a skewness in the (otherwise symmetric) noise distribution. Let’s try running the --sigmaclip option as above:

$ aststatistics build/in-9.fits --sigmaclip
Statistics (GNU Astronomy Utilities) 0.24
-------
Input: build/in-9.fits (hdu: 1)
-------
3-sigma clipping steps until relative change in STD is less than 0.1:

round number     median       STD
1     40401      1.09295e+01  1.34784e+01
2     39618      1.06762e+01  1.19852e+01
3     39126      1.05265e+01  1.12983e+01
-------
Statistics (after clipping):
  Number of input elements:                40401
  Number of clips:                         2
  Final number of elements:                39126
  Median:                                  1.052652e+01
  Mean:                                    1.114819e+01
  Standard deviation:                      1.129831e+01
  Median Absolute Deviation:               7.106166e+00

We see that the median, mean and standard deviation are over estimated (each worse than the time when the circle was brighter!). Let’s look at the $\sigma$-clipping coadd:

$ astscript-fits-view build/coadd-std.fits \
                      build/coadd-sigclip-std.fits \
                      build/coadd-*mean.fits \
                      build/coadd-*median.fits \
                      --ds9extra="-tile grid layout 2 3 -scale minmax"

Compared to the previous run (where the outlying circle was brighter), we see that $\sigma$-clipping is now less successful in removing the outlying circle from the coadds; or in the single value measurements. To see the reason, we can have a look at the numbers image (note that here, we are using -h2 to only see the numbers image)

$ astscript-fits-view build/coadd-sigclip-median.fits -h2 \
                      --ds9extra="-scale minmax"

Unlike before (where the density of pixels with 8 images was very high over the circle’s footprint), the circle is barely visible in the numbers image! There is only a very weak clustering of pixels with a value of 8 over the circle’s footprint. This has happened because the outliers have biased the standard deviation itself to a level that includes them with this multiple of the standard deviation.

To gauge if $\sigma$-clipping will be useful for your dataset, you should inspect the bimodality of its histogram like we did above. But you can’t do this manually in every case (as in the coadding which involved more than forty thousand separate $\sigma$-clippings: one for every output)! Clipping outliers should be based on a different (from standard deviation) measure of scatter/dispersion, one that is more robust (less affected by outliers). Therefore, in Gnuastro we also have median absolute deviation (MAD) clipping which is described in the next section (MAD clipping).