Building inputs and analysis without clipping (GNU Astronomy Utilities)

Next: Sigma clipping, Previous: Clipping outliers, Up: Clipping outliers [Contents][Index]

2.10.1 Building inputs and analysis without clipping ¶

As described in Clipping outliers, the goal of this tutorial is to demonstrate the effects of outliers and show how to “clip” them from basic statistics measurements. This is best done on an actual dataset (rather than pure theory). In this section we will build nine noisy images with the script below, such that one of the images has a circle in the middle. We will then stack the 9 images into one final image based on different statistical measurements: the mean, median, standard deviation (STD), median absolute deviation (MAD) and number of inputs used in each pixel. We will then analyze the resulting stacks to demonstrate the problem with outliers.

Put the script below into a plain-text file (assuming it is called script.sh), and run it with bash ./script.sh. For more on writing and good practices in shell scripts, see Writing scripts to automate the steps. The last command of the script above calls DS9 to visualize the five output stacked images mentioned above.

# Constants
list=""
sigma=10
number=9
radius=30
width=201
bdir=build
profsum=3e5
background=10
random_seed=1699270427


# Clipping parameters (will be set later when we start clipping).
# clip_multiple:  3   for sigma; 4.48 for MAD
# clip_tolerance: 0.1 for sigma; 0.01 for MAD
clip_operator=""
clip_multiple=""
clip_tolerance=""


# Stop if there is any error.
set -e


# If the build directory does not exist, build it.
if ! [ -d $bdir ]; then mkdir $bdir; fi


# The final image (with largest number) will contain the outlier:
# we'll put a flat circle in the center of the image as the outlier
# structure.
outlier=$bdir/in-$number.fits
nn=$bdir/$number-no-noise.fits
export GSL_RNG_SEED=$random_seed
center=$(echo $width | awk '{print int($1/2)+1}')
echo "1 $center $center 5 $radius 0 0 1 $profsum 1" \
    | astmkprof --mode=img --mergedsize=$width,$width \
                --oversample=1 --output=$nn --mcolissum
astarithmetic $nn $background + $sigma mknoise-sigma \
	      --envseed -o$outlier


# Build pure noise and add elements to the list of images to stack.
list=$outlier
numnoise=$(echo $number | awk '{print $1-1}')
for i in $(seq 1 $numnoise); do
    img="$bdir/in-$i.fits"
    if ! [ -f $img ]; then
	export GSL_RNG_SEED=$(echo $random_seed | awk '{print $1+'$i'}')
	astarithmetic $width $width 2 makenew float32 $background + \
		      $sigma mknoise-sigma --envseed --output=$img
    fi
    list="$list $img"
done


# Stack the images,
for op in mean median std mad number; do
    if [ x"$clip_operator" = x ]; then
	out=$bdir/stack-$op.fits
	astarithmetic $list $number $op -g1 --output=$out
    else
	operator=$clip_operator-$op
	out=$bdir/stack-$operator.fits
	astarithmetic $list $number $clip_multiple $clip_tolerance \
		      $operator -g1 --output=$out
    fi
done


# Collapse the first and last image along the 2nd dimension.
for i in 1 $number; do
    if [ x"$clip_operator" = x ]; then
	out=$bdir/collapsed-$i.fits
	astarithmetic $bdir/in-$i.fits 2 collapse-median counter \
		      --writeall --output=$out
    else
	out=$bdir/collapsed-$clip_operator-$i.fits
	astarithmetic $bdir/in-$i.fits $clip_multiple $clip_tolerance \
		      2 collapse-$clip_operator-median counter \
		      --writeall --output=$out
    fi
done

After the script finishes, you can see the generated input images with the first command below. The second command shows the stacked images.

$ astscript-fits-view build/in-*.fits --ds9extra="-lock scalelimits yes"
$ astscript-fits-view build/stack-*.fits

Color-blind readers may not clearly see the issue in the opened images with this color bar. In this case, please choose the “color” menu at the top of the DS9 and select “gray” or any other color that makes the circle most visible.

The effect of an outlier on the different measurements above can be visually seen (and quantitatively measured) through the visibility of the circle (that was only present in one image, of nine). Let’s look at them one by one (from the one that is most affected to the least):

std.fits

The standard deviation (third image in DS9) is the most strongly affected statistic by an outlier. This is so strong that the edge of the circle is also clearly visible! The standard deviation is calculated by first finding the mean, and estimating the difference of each element from the mean. Those differences are then taken to the power of two and finally the square root is taken (after a division by the number). It is the power-of-two component that amplifies the effect of the single outlier as you see here.

mean.fits

The mean (first image in DS9) is also affected by the outlier in such a way that the circle footprint is clearly visible. This is because the nine images have the same importance in the combination with a simple mean. Therefore, the outlier value pushes the result to higher values and the circle is printed.

median.fits

The median (second image in DS9) is also affected by the outlier; although much less significantly than the standard deviation or mean. At first sight the circle may not be too visible! To see it more clearly, click on the “Analysis” menu in DS9 and then the “smooth” item. After smoothing, you will see how the single outlier has leaked into the median stack.

Intuitively, we would think that since the median is calculated from the middle element after sorting, the outlier goes to the end and won’t affect the result. However, this is not the case as we see here: with 9 elements, the “central” element is the 5th (counting from 1; after sorting). Since the pixels covered by the circle only have 8 pure noise elements; the “true” median should have been the average of the 4th and 5th elements (after sorting). By definition, the 5th element is always larger than the mean of the 4th and 5th (because the 4th element after sorting has a smaller value than the 5th element). Therefore, using the 5th element (after sorting), we are systematically choosing higher noise values in regions that are covered by the circle!

With larger datasets, the difference between the central elements will be less. However, the improved precision (in the elements without an outlier) will also be more. A detailed analysis of the effect of a single outlier on the median based on the number of inputs can be done as an exercise; but in general, as this argument shows, the median is not immune to outliers; especially when you care about low signal-to-noise regimes (as we do in astronomy: low surface brightness science).

mad.fits

The median absolute deviation (fourth image in DS9) is affected by outliers in a similar fashion to the median.

number.fits

The number image (last image in DS9) shows the number of images that went into each pixel. Since we haven’t rejected any outliers (yet!), all the pixels in this image have a value of 9.

The example above included a single outlier. But we are not usually that lucky: there are usually more outliers! For example, the last command in the script above collapsed 1.fits (that was pure noise, without the circle) and 9.fits (with the circle) along their second dimension (the vertical). Collapsing was done by taking the median along all the pixels in the vertical dimension. The output of collapsing has one less dimension; in this case, producing a 1D table (with the same number of rows as the image’s horizontal axis). To easily plot the output afterwards, we have also used the counter operator. With the command below, you can open both tables and compare them:

$ astscript-fits-view build/collapsed-*.fits

The last command opens TOPCAT. In the “Graphics” menu, select plane plot and you will see all the values fluctuating around 10 (with a maximum/minimum around \(\pm2\)). Afterwards, click on the “Layers” menu and click on “Add position control”. In the opened tab at the bottom (where the scroll bar in front of “Table” is empty), select the other table. In the regions that there was no circle in any of the vertical axes, the two match nicely (the noise level is the same). However, you see that the image columns that were partly covered by the outlying circle gradually get more affected as the width of the circle in that column increases (the full diameter of the circle was in the middle of the image). This shows how the median is biased by outliers as their number increases.

To see the problem more prominently, use the collapse-mean operator instead of the median. The reason that the mean is more strongly affected by the outlier is exactly the same as above for the stacking of the input images. In the subsections below, we will describe some of the basic ways to reject the effect of these outliers (and have better stacks or collapses). But the methodology is not limited to these types of data and can be generically applied; unless specified explicitly.