Put simply, noise can be characterized with a certain spread about the measured value. In the Gaussian distribution (most commonly used to model noise) the spread is defined by the standard deviation about the characteristic mean.
Let’s start by clarifying some definitions first: Data is defined as the combination of signal and noise (so a noisy image is one dataset). Signal is defined as the mean of the noise on each element. We’ll also assume that the background (see Sky value definition) is subtracted and is zero.
When a data set doesn’t have any signal (only noise), the mean, median and mode of the distribution are equal within statistical errors and approximately equal to the background value. Signal always has a positive value and will never become negative, see Figure 1 in Akhlaghi and Ichikawa (2015). Therefore, as more signal is added, the mean, median and mode of the dataset shift to the positive. The mean’s shift is the largest. The median shifts less, since it is defined based on an ordered distribution and so is not affected by a small number of outliers. The distribution’s mode shifts the least to the positive.
Inverting the argument above gives us a robust method to quantify the significance of signal in a dataset. Namely, when the mean and median of a distribution are approximately equal, or the mean’s quantile is around 0.5, we can argue that there is no significant signal.
To allow for gradients (which are commonly present in ground-based images), we can consider the image to be made of a grid of tiles (see Tessellation135). Hence, from the difference of the mean and median on each tile, we can estimate the significance of signal in it. The median of a distribution is defined to be the value of the distribution’s middle point after sorting (or 0.5 quantile). Thus, to estimate the presence of signal, we’ll compare with the quantile of the mean with 0.5. If the absolute difference in a tile is larger than the value given to the --meanmedqdiff option, that tile will be ignored. You can read this option as “mean-median-quantile-difference”.
The raw dataset’s distribution is noisy, so using the argument above on the raw input will give a noisy result. To decrease the noise/error in estimating the mode, we will use convolution (see Convolution process). Convolution decreases the range of the dataset and enhances its skewness, See Section 3.1.1 and Figure 4 in Akhlaghi and Ichikawa (2015). This enhanced skewness can be interpreted as an increase in the Signal to noise ratio of the objects buried in the noise. Therefore, to obtain an even better measure of the presence of signal in a tile, the mean and median discussed above are measured on the convolved image.
Through the difference of the mean and median we have actually ‘detected’ data in the distribution. However this “detection” was only based on the total distribution of the data in each tile (a much lower resolution). This is the main limitation of this technique. The best approach is thus to do detection over the dataset, mask all the detected pixels and use the undetected regions to estimate the sky and its standard deviation (possibly over a tessellation). This is how NoiseChisel works: it uses the argument above to find tiles that are used to find its thresholds. Several higher-level steps are done on the thresholded pixels to define the higher-level detections (see NoiseChisel).
There is one final hurdle: raw astronomical datasets are commonly peppered with Cosmic rays. Images of Cosmic rays aren’t smoothed by the atmosphere or telescope aperture, so they have sharp boundaries. Also, since they don’t occupy too many pixels, they don’t affect the mode and median calculation. But their very high values can greatly bias the calculation of the mean (recall how the mean shifts the fastest in the presence of outliers), for example see Figure 15 in Akhlaghi and Ichikawa (2015).
The effect of outliers like cosmic rays on the mean and standard deviation can be removed through \(\sigma\)-clipping, see Sigma clipping for a complete explanation. Therefore, after asserting that the mode and median are approximately equal in a tile (see Tessellation), the final Sky value and its standard deviation are determined after \(\sigma\)-clipping with the --sigmaclip option.
In the end, some of the tiles will pass the mean and median quantile difference test. However, prior to interpolating over the failed tiles, another point should be considered: large and extended galaxies, or bright stars, have wings which sink into the noise very gradually. In some cases, the gradient over these wings can be on scales that is larger than the tiles. The mean-median distance test will pass on such tiles and will cause a strong peak in the interpolated tile grid, see Detecting large extended targets.
The tiles that exist over the wings of large galaxies or bright stars are outliers in the distribution of tiles that passed the mean-median quantile distance test. Therefore, the final step of “quantifying signal in a tile” is to look at this distribution and remove the outliers. \(\sigma\)-clipping is a good solution for removing a few outliers, but the problem with outliers of this kind is that there may be many such tiles (depending on the large/bright stars/galaxies in the image). Therefore a novel outlier rejection algorithm will be used.
To identify the first outlier, we’ll use the distribution of distances between sorted elements. If there are \(N\) successful tiles, for every tile, the distance between the adjacent \(N/2\) previous elements is found, giving a distribution containing \(N/2-1\) points. The \(\sigma\)-clipped median and standard deviation of this distribution is then found (\(\sigma\)-clipping is configured with --outliersclip). Finally, if the distance between the element and its previous element is more than --outliersigma multiples of the \(\sigma\)-clipped standard deviation added with the \(\sigma\)-clipped median, that element is considered an outlier and all tiles larger than that value are ignored.
Formally, if we assume there are \(N\) elements. They are first sorted. Searching for the outlier starts on element \(N/2\) (integer division). Let’s take \(v_i\) to be the \(i\)-th element of the sorted input (with no blank values) and \(m\) and \(\sigma\) as the \(\sigma\)-clipped median and standard deviation from the distances of the previous \(N/2-1\) elements (not including \(v_i\)). If the value given to --outliersigma is displayed with \(s\), the \(i\)-th element is considered as an outlier when the condition below is true.
Since \(i\) begins from the median, the outlier has to be larger than the median. You can use the check images (for example --checksky in the Statistics program or --checkqthresh, --checkdetsky and --checksky options in NoiseChisel for any of its steps that uses this outlier rejection) to inspect the steps and see which tiles have been discarded as outliers prior to interpolation.