As described in Clipping outliers, the goal of this tutorial is to demonstrate the effects of outliers and show how to “clip” them from basic statistics measurements. This is best done on an actual dataset (rather than pure theory). In this section we will build nine noisy images with the script below, such that one of the images has a circle in the middle. We will then stack the 9 images into one final image based on different statistical measurements: the mean, median, standard deviation (STD), median absolute deviation (MAD) and number of inputs used in each pixel. We will then analyze the resulting stacks to demonstrate the problem with outliers.
Put the script below into a plain-text file (assuming it is called script.sh), and run it with bash ./script.sh
.
For more on writing and good practices in shell scripts, see Writing scripts to automate the steps.
The last command of the script above calls DS9 to visualize the five output stacked images mentioned above.
# Constants list="" sigma=10 number=9 radius=30 width=201 bdir=build profsum=3e5 background=10 random_seed=1699270427 # Clipping parameters (will be set later when we start clipping). # clip_multiple: 3 for sigma; 4.5 for MAD # clip_tolerance: 0.1 for sigma; 0.01 for MAD clip_operator="" clip_multiple="" clip_tolerance="" # Stop if there is any error. set -e # If the build directory does not exist, build it. if ! [ -d $bdir ]; then mkdir $bdir; fi # The final image (with largest number) will contain the outlier: # we'll put a flat circle in the center of the image as the outlier # structure. outlier=$bdir/in-$number.fits nn=$bdir/$number-no-noise.fits export GSL_RNG_SEED=$random_seed center=$(echo $width | awk '{print int($1/2)+1}') echo "1 $center $center 5 $radius 0 0 1 $profsum 1" \ | astmkprof --mode=img --mergedsize=$width,$width \ --oversample=1 --output=$nn --mcolissum astarithmetic $nn $background + $sigma mknoise-sigma \ --envseed -o$outlier # Build pure noise and add elements to the list of images to stack. list=$outlier numnoise=$(echo $number | awk '{print $1-1}') for i in $(seq 1 $numnoise); do img="$bdir/in-$i.fits" if ! [ -f $img ]; then export GSL_RNG_SEED=$(echo $random_seed | awk '{print $1+'$i'}') astarithmetic $width $width 2 makenew float32 $background + \ $sigma mknoise-sigma --envseed --output=$img fi list="$list $img" done # Stack the images. for op in mean median std mad; do if [ x"$clip_operator" = x ]; then # No clipping. out=$bdir/stack-$op.fits astarithmetic $list $number $op -g1 --output=$out else # With clipping. operator=$clip_operator-$op out=$bdir/stack-$operator.fits astarithmetic $list $number $clip_multiple $clip_tolerance \ $operator -g1 --writeall --output=$out fi done # Collapse the first and last image along the 2nd dimension. for i in 1 $number; do if [ x"$clip_operator" = x ]; then # No clipping. out=$bdir/collapsed-$i.fits astarithmetic $bdir/in-$i.fits 2 collapse-median counter \ --writeall --output=$out else # With clipping. out=$bdir/collapsed-$clip_operator-$i.fits astarithmetic $bdir/in-$i.fits $clip_multiple $clip_tolerance \ 2 collapse-$clip_operator-median counter \ --writeall --output=$out fi done
After the script finishes, you can see the generated input images with the first command below. The second command shows the stacked images with the different statistics:
$ astscript-fits-view build/in-*.fits --ds9extra="-lock scalelimits yes" $ astscript-fits-view build/stack-*.fits
Color-blind readers may not clearly see the issue in the opened images with this color bar. In this case, please choose the “color” menu at the top of the DS9 and select “gray” or any other color that makes the noisy circle (in the noise) most visible.
The effect of an outlier on the different measurements above can be visually seen (and quantitatively measured) through the visibility of the circle (that was only present in one image, of nine). Let’s look at them one by one (from the one that is most affected to the least):
The standard deviation (third image in DS9) is the most strongly affected statistic by an outlier. This is so strong that the edge of the circle is also clearly visible! The standard deviation is calculated by first finding the mean, and estimating the difference of each element from the mean. Those differences are then taken to the power of two and finally the square root is taken (after a division by the number). It is the power-of-two component that amplifies the effect of the single outlier as you see here.
The mean (first image in DS9) is also affected by the outlier in such a way that the circle footprint is clearly visible. This is because the nine images have the same importance in the combination with a simple mean. Therefore, the outlier value pushes the result to higher values and the circle is printed.
The median (second image in DS9) is also affected by the outlier; although much less significantly than the standard deviation or mean. At first sight the circle may not be too visible! To see it more clearly, click on the “Analysis” menu in DS9 and then the “smooth” item. After smoothing, you will see how the single outlier has leaked into the median stack.
Intuitively, we would think that since the median is calculated from the middle element after sorting, the outlier goes to the end and won’t affect the result. However, this is not the case as we see here: with 9 elements, the “central” element is the 5th (counting from 1; after sorting). Since the pixels covered by the circle only have 8 pure noise elements; the “true” median should have been the average of the 4th and 5th elements (after sorting). By definition, the 5th element is always larger than the mean of the 4th and 5th (because the 4th element after sorting has a smaller value than the 5th element). Therefore, using the 5th element (after sorting), we are systematically choosing higher noise values in regions that are covered by the circle!
With larger datasets, the difference between the central elements will be less. However, the improved precision (in the elements without an outlier) will also be more. A detailed analysis of the effect of a single outlier on the median based on the number of inputs can be done as an exercise; but in general, as this argument shows, the median is not immune to outliers; especially when you care about low signal-to-noise regimes (as we do in astronomy: low surface brightness science).
The median absolute deviation (fourth image in DS9) is affected by outliers in a similar fashion to the median.
The example above included a single outlier.
But we are not usually that lucky: there are usually more outliers!
For example, the last for
loop in the script above collapsed 1.fits (that was pure noise, without the circle) and 9.fits (with the circle) along their second dimension (the vertical).
The output of collapsing has one less dimension; in this case, producing a 1D table (with the same number of rows as the image’s horizontal axis).
Collapsing was done by taking the median along all the pixels in the vertical dimension.
To easily plot the output afterwards, we have also used the counter
operator.
With the command below, you can open both tables and compare them:
$ astscript-fits-view build/collapsed-*.fits
After TOPCAT has opened, select collapsed-1.fits in the “Table List” side-bar. In the “Graphics” menu, select “Plane plot” and you will see all the values fluctuating around 10 (with a maximum/minimum around \(\pm2\)). Afterwards, click on the “Layers” menu of the new window (with a plot) and click on “Add position control”. At the bottom of the window (where the scroll bar in front of “Table” is empty), select collapsed-9.fits. In the regions that there was no circle in any of the vertical axes, the two match nicely (the noise level is the same). However, you see that the regions that were partly covered by the outlying circle gradually get more affected as the width of the circle in that column increases (the full diameter of the circle was in the middle of the image). This shows how the median is biased by outliers as their number increases.
To see the problem more prominently, use the collapse-mean
operator instead of the median.
The reason that the mean is more strongly affected by the outlier is exactly the same as above for the stacking of the input images.
In the subsections below, we will describe some of the available ways (in Gnuastro) to reject the effect of these outliers (and have better stacks or collapses).
But the methodology is not limited to these types of data and can be generically applied; unless specified explicitly.
JavaScript license information
GNU Astronomy Utilities 0.23 manual, July 2024.