Skewness caused by signal and its measurement (GNU Astronomy Utilities)

Next: Image surface brightness limit, Previous: NoiseChisel optimization, Up: Detecting large extended targets [Contents][Index]

2.2.3 Skewness caused by signal and its measurement ¶

In the previous section (NoiseChisel optimization) we showed how to customize NoiseChisel for a single-exposure SDSS image of the M51 group. During the customization, we also discussed the skewness caused by signal. In the next section (Image surface brightness limit), we will use this to measure the surface brightness limit of the image. However, to better understand NoiseChisel and also, the image surface brightness limit, understanding the skewness caused by signal, and how to measure it properly are very important. Therefore now that we have separated signal from noise, let’s pause for a moment and look into skewness, how signal creates it, and find the best way to measure it.

Let’s start masking all the detected pixels found at the end of the previous section (NoiseChisel optimization) and having a look at the noise distribution with Gnuastro’s Arithmetic and Statistics programs as shown below (while visually inspecting the masked image with DS9 in the middle).

$ astarithmetic r_detected.fits -hINPUT-NO-SKY set-in \
                r_detected.fits -hDETECTIONS set-det \
                in det nan where -odet-masked.fits
$ ds9 det-masked.fits
$ aststatistics det-masked.fits

You will see that Gnuastro’s Statistics program prints an ASCII histogram when no option is given (it is shown below). This is done to give you a fast and easy view of the distribution of values in the dataset (pixels in an image, or rows in a table’s column).

-------
Input: det-masked.fits (hdu: 1)
-------
  Number of elements:                      903920
  Minimum:                                 -0.113543
  Maximum:                                 0.130339
  Median:                                  -0.00216306
  Mean:                                    -0.0001893073877
  Standard deviation:                      0.02569057188
-------
Histogram:
 |                              ** *
 |                            * ** *  *
 |                           ** ** *  *
 |                         * ** ** ** *
 |                        ** ** ** ** * **
 |                        ** ** ** ** * ** *
 |                      * ** ** ** ** * ** **
 |                     ** ** ** ** **** ** ** *
 |                  ** ** ** ** ** **** ** ** ** *
 |               ** ** ** ** ** ** ******* ** ** ** *
 |*********** ** ** ** ******************* ** ** ** ** ***** ** ***** **
 |----------------------------------------------------------------------

This histogram shows a roughly symmetric noise distribution, so let’s have a look at its skewness. The most commonly used definition of skewness is known as the “Pearson’s first skewness coefficient”. It measures the difference between the mean and median, in units of the standard deviation (STD):

$$\rm{Skewness}\equiv\frac{(\rm{mean}-\rm{median})}{\rm{STD}}$$

The logic behind this definition is simple: as more signal is added to the same pixels that originally only have raw noise (skewness is increased), the mean shifts to the positive faster than the median, so the distance between the mean and median should increase. Let’s measure the skewness (as defined above) over the image without any signal. Its very easy with Gnuastro’s Statistics program (and piping the output to AWK):

$ aststatistics det-masked.fits --mean --median --std \
                | awk '{print ($1-$2)/$3}'
0.0768279

We see that the mean and median are only $0.08\sigma$ (rounded) away from each other (which is very close)! All pixels with significant signal are masked, so this is expected, and everything is fine. Now, let’s check the pixel distribution of the sky-subtracted input (where pixels with significant signal remain, and are not masked):

$ ds9 r_detected.fits
$ aststatistics r_detected.fits -hINPUT-NO-SKY
-------
Input: r_detected.fits (hdu: INPUT-NO-SKY)
Unit: nanomaggy
-------
  Number of elements:                      3049472
  Minimum:                                 -0.113543
  Maximum:                                 159.25
  Median:                                  0.0241158
  Mean:                                    0.1057885317
  Standard deviation:                      0.698167489
-------
Histogram:
 |*
 |*
 |*
 |*
 |*
 |*
 |*
 |*
 |*
 |*
 |******************************************* ***  ** ****  * *   *  * *
 |----------------------------------------------------------------------

Comparing the distributions above, you can see that the minimum value of the image has not changed because we have not masked the minimum values. However, as expected, the maximum value of the image has changed (from $0.13$ to $159.25$). This is clearly evident from the ASCII histogram: the distribution is very elongated because the galaxy inside the image is extremely bright.

Now, let’s limit the displayed information with the --lessthan=0.13 option of Statistics as shown below (to only use values less than 0.13; the maximum of the image where all signal is masked).

$ aststatistics r_detected.fits -hINPUT-NO-SKY --lessthan=0.13
-------
Input: r_detected.fits (hdu: INPUT-NO-SKY)
Range: up to (exclusive) 0.13.
Unit: nanomaggy
-------
  Number of elements:                      2531949
  Minimum:                                 -0.113543
  Maximum:                                 0.126233
  Median:                                  0.0137138
  Mean:                                    0.01735551527
  Standard deviation:                      0.03590550597
-------
Histogram:
 |                                   *
 |                                * ** **
 |                             *  * ** ** **
 |                             *  * ** ** ** *
 |                           * ** * ** ** ** *
 |                          ** ** * ** ** ** *  *
 |                        * ** ** * ** ** ** *  *
 |                       ** ** ** * ** ** ** ** * ** *
 |                     * ** ** **** ** ** ** **** ** ** **
 |                  * ** ** ** **** ** ** ** ******* ** ** ** * ** ** **
 |***** ** ********** ** ** ********** ** ********** ** ************* **
 |----------------------------------------------------------------------

The improvement is obvious: the ASCII histogram better shows the pixel values near the noise level. We can now compare with the distribution of det-masked.fits that we found earlier. The ASCII histogram of det-masked.fits was approximately symmetric, while this is asymmetric in this range, especially in outer (to the right, or positive) direction. The heavier right-side tail is a clear visual demonstration of skewness that is caused by the signal in the un-masked image.

Having visually confirmed the skewness, let’s quantify it with Pearson’s first skewness coefficient. Like before, we can simply use Gnuastro’s Statistics and AWK for the measurement and calculation:

$ aststatistics r_detected.fits --mean --median --std \
                | awk '{print ($1-$2)/$3}'
0.116982

The difference between the mean and median is now approximately $0.12\sigma$. This is larger than the skewness of the masked image (which was approximately $0.08\sigma$). At a glance (only looking at the numbers), it seems that there is not much difference between the two distributions. However, visually looking at the non-masked image, or the ASCII histogram, you would expect the quantified skewness to be much larger than that of the masked image, but that has not happened! Why is that?

The reason is that the presence of signal does not only shift the mean and median, it also increases the standard deviation! To see this for yourself, compare the standard deviation of det-masked.fits (which was approximately $0.025$) to r_detected.fits (without --lessthan; which was approximately $0.699$). The latter is almost 28 times larger!

This happens because the standard deviation is defined only in a symmetric (and Gaussian) distribution. In a non-Gaussian distribution, the standard deviation is poorly defined and is not a good measure of “width”. Since Pearson’s first skewness coefficient is defined in units of the standard deviation, this very large increase in the standard deviation has hidden the much increased distance between the mean and median after adding signal.

We therefore need a better unit or scale to quantify the distance between the mean and median. A unit that is less affected by skewness or outliers. One solution that we have found to be very useful is the quantile units or quantile scale. The quantile scale is defined by first sorting the dataset (which has $N$ elements). If we want the quantile of a value $V$ in a distribution, we first find the nearest data element to $V$ in the sorted dataset. Let’s assume the nearest element is the $i$-th element, counting from 0, after sorting. The quantile of V in that distribution is then defined as $i/(N-1)$ (which will have a value between 0 and 1).

The quantile of the median is obvious from its definition: 0.5. This is because the median is defined to be the middle element of the distribution after sorting. We can therefore define skewness as the quantile of the mean ($q_m$). If $q_m\sim0.5$ (the median), then the distribution (of signal blended in noise) is symmetric (possibly Gaussian, but the functional form is irrelevant here). A larger value for $|q_m-0.5|$ quantifies a more skewed the distribution. Furthermore, a $q_m>0.5$ signifies a positive skewness, while $q_m<0.5$ signifies a negative skewness.

Let’s put this definition to a test on the same two images we have already created. Fortunately Gnuastro’s Statistics program has the --quantofmean option to easily calculate $q_m$ for you. So testing is easy:

$ aststatistics det-masked.fits --quantofmean
0.51295636

$ aststatistics r_detected.fits -hINPUT-NO-SKY --quantofmean
0.8105163158

The two quantiles of mean are now very distinctly different ($0.51$ and $0.81$): differing by about $0.3$ (on a scale of 0 to 1)! Recall that when defining skewness with Pearson’s first skewness coefficient, their difference was negligible ($0.04\sigma$)! You can now better appreciate why we discussed quantile so extensively in NoiseChisel optimization. In case you would like to know more about the usage of the quantile of the mean in Gnuastro, please see Quantifying signal in a tile, or watch this video demonstration: https://peertube.stream/w/35b7c398-9fd7-4bcf-8911-1e01c5124585.