Random number generators (GNU Astronomy Utilities)

Next: Coordinate and border operators, Previous: Numerical type conversion operators, Up: Arithmetic operators [Contents][Index]

6.2.4.16 Random number generators ¶

When you simulate data (for example, see Sufi simulates a detection), everything is ideal and there is no noise! The final step of the process is to add simulated noise to the data. The operators in this section are designed for that purpose. To learn more about the definition and implementation “noise”, see Noise basics.

In case each data element’s random distribution should have an independent parameter (for example \(\sigma\) in a Gaussian distribution), the first popped operand can be a dataset of the same size as the second. In this case (when the parameter is not a single value, but an array), each element will have a different parameter.

When --quiet is not given, a statement will be printed on each invocation of these operators (if there are multiple calls to the mknoise-* operators, the statement will be printed multiple times). It will show the random number generator function and seed that was used in that invocation. These are necessary for the future reproducibility of the outputs using the --envseed option, for more, see Generating random numbers. For example, with the first command below, image.fits will be degraded by a noise of standard deviation 3 units.

$ astarithmetic image.fits 3 mknoise-sigma

Alternatively, you can use the operators in this section within the Column arithmetic feature of the Table program. For example, with the command below, you can generate a random number (centered on 0, with \(\sigma=3\)). With the second command, you can put it into a shell variable for later usage.

$ echo 0 | asttable -c'arith $1 3 mknoise-sigma'
$ value=$(echo 0 | asttable -c'arith $1 3 mknoise-sigma' --quiet)
$ echo $value

You can also use the operators here in combination with AWK to easily generate an arbitrarily large table with random columns. In the example below, we will create a two column table with 20 rows. The first column will be centered on 5 and \(\sigma_1=2\), the second will be centered on 10 and \(\sigma_2=3\):

$ echo 5 10 \
       | awk '{for(i=0;i<20;++i) print $1, $2}' \
       | asttable -c'arith $1 2 mknoise-sigma' \
                  -c'arith $2 3 mknoise-sigma'

By adding an extra --output=random.fits, the table will be saved into a file called random.fits, and you can change the i<20 to i<5000 to have 5000 rows instead. Of course, if your input table has different values in the desired column the noisy distribution will be centered on each input element, but all will have the same scatter/sigma.

As mentioned above, you can use the --envseed option to pre-define the random number generator seed (and thus get a reproducible result). For more on --envseed, see Generating random numbers. When using column arithmetic in Table, it may happen that multiple columns need random numbers (with any of the mknoise-* operators) in one call of asttable. In such cases, the value given to GSL_RNG_SEED is incremented by one on every call to the mknoise-* operators. Without this increment, when the column values are the same (happens a lot, for no-noised datasets), the returned values for all columns will be identical. But this feature has a side-effect: that if the order of calling the mknoise-* operators changes, the seeds used for each operator will change¹⁶¹.

mknoise-sigma

Add a Gaussian noise with pre-defined \(\sigma\) to each element of the input dataset (independent of the input pixel value). \(\sigma\) is the standard deviation of the Gaussian or Normal distribution. This operator takes two arguments: the top/first popped operand is the noise standard deviation, the next popped operand is the dataset that the noise should be added to.

For example, with the first command below, let’s put a Sérsic profile with Sérsic index 1 and effective radius 10 pixels, truncated at 5 times the effective radius in the center of a mock image that is \(100\times100\) pixels wide. We will also give it a position angle of 45 degrees and an axis ratio of 0.8, and set it to have a total electron count of 10000 (1e4 in the command). Note that this example is focused on this operator, for a robust simulation, see the tutorial in Sufi simulates a detection. With the second command, let’s add noise to this image and with the third command, we’ll subtract the raw image from the noised image. Finally, we’ll view them both together:

$ echo "1 50 50 1 10 1 45 0.8 1e4 5" \
       | astmkprof --mergedsize=100,100 --oversample=1 \
                   --mcolissum --output=raw.fits

$ astarithmetic raw.fits 2 mknoise-sigma --output=sigma.fits

$ astarithmetic raw.fits sigma.fits - -g1 \
                --output=diff-sigma.fits

$ astscript-fits-view raw.fits sigma.fits diff-sigma.fits

You see that the final diff-sigma.fits distribution was independent of the pixel values of the input. You will also notice that within sigma.fits the noisy pixels that had a zero value in raw.fits, the noise fluctuates around zero (is negative in half of those pixels). These behaviors will be different in the case for mknoise-sigma-from-mean below, which is more “realistic” (or Poisson-like).

mknoise-sigma-from-mean ¶

Replace each input element (e.g., pixel in an image) of the input with a random value taken from a Gaussian distribution (for pixel \(i\)) with mean \(\mu_i\) and standard deviation \(\sigma_i\). Where, \(\sigma_i=\sqrt{I_i+B_i}\) and \(\mu_i=I_i+B_i\) and \(I_i\) and \(B_i\) are respectively the values of the input image, and background in that same pixel. In other words, this can be seen as approximating a Poisson distribution at high mean values (where the Poisson distribution becomes identical to the Gaussian distribution).

This operator takes two arguments: 1. the first popped operand (just before the operator) is the per-pixel background value (in units of electron counts). 2. The second popped operand is the dataset that the noise should be added to.

To demonstrate the effect of this noise pattern, please run the example commands in the description of mknoise-sigma. With the first command below, let’s add this Poisson-like noise (assuming a background level of 4 electron counts, to be similar to a \(\sigma=2\) of the example in mknoise-sigma). With the second command, let’s subtract the raw image from this noise pattern:

$ astarithmetic raw.fits 4 mknoise-sigma-from-mean \
                --output=sigma-from-mean.fits

$ astarithmetic raw.fits sigma-from-mean.fits - -g1 \
                --output=diff-sigma-from-mean.fits

$ astscript-fits-view diff-sigma.fits \
                      diff-sigma-from-mean.fits

You clearly see how the noise in the center of the Sérsic profile is much stronger than the outer parts. As described, above, this is behavior we would expect in a “real” observation: the regions with stronger signal, also have stronger noise as defined through the Poisson distribution! The reason we described this operator as “Poisson-like” is that, it has some shortcomings as opposed to the mknoise-poisson operator (that is described below):

For low mean values (less than 3 for example), this will produce a symmetric Gaussian distribution, while the Poisson distribution will not be symmetric.
The random values from this distribution are floating point (unlike the Poisson distribution that produces integers.
The random values can be negative (which is not possible in a Poisson distribution).

Therefore to simulate photon-starved images (for example UV or X-ray data), the mknoise-poisson operator should always be used, not this one. However, in optical (or redder bands) data, the background is very bright (much brighter than 10 counts for example). In such cases (as the mean increases), the Poisson distributions becomes identical to the Gaussian distribution. Furthermore, processed co-add/stacked images are no longer integers, but floating points with the Sky-level already subtracted (see Sky value). Therefore if you are trying to simulate a processed, photon-rich dataset, you can safely use this operator.

Recall that the background values reported by observatories (for example, to define dark or gray nights), or in papers, is usually reported in units of magnitudes per arcseconds square. You need to do the conversion to counts per pixel manually. The conversion of magnitudes to counts is described below. For converting arcseconds squared to number of pixels, you can use the --pixelscale option of Fits. For example, astfits image.fits --pixelscale.

Except for the noise-model, this operator is very similar to mknoise-sigma and the examples there apply here too. The main difference with mknoise-sigma is that in a Poisson distribution the scatter/sigma will depend on each element’s value.

For example, let’s assume you have made a mock image called mock.fits with MakeProfiles and it is assumed zero point is 22.5 (for more on the zero point, see Brightness, Flux, Magnitude and Surface brightness). Let’s assume the background level for the Poisson noise has a value of 19 magnitudes. You can first use the mag-to-counts operator to convert this background magnitude into counts, then feed the background value in counts to mknoise-sigma-from-mean operator:

$ astarithmetic mock.fits 19 22.5 mag-to-counts \
                mknoise-sigma-from-mean

Try changing the background value from 19 to 10 to see the effect! Recall that the tutorial Sufi simulates a detection shows how you can use MakeProfiles to build mock images.

mknoise-poisson

Replace every pixel of the input with a random integer taken from a Poisson distribution with the mean value of that input pixel. Similar to mknoise-sigma-from-mean, it takes two operands: 1. The first popped operand (just before the operator) is the per-pixel background value (in units of electron counts). 2. The second popped operand is the dataset that the noise should be added to.

To demonstrate this noise pattern, let’s use mknoise-poisson in the example of the description of mknoise-sigma-from-mean with the first command below. The second command below will show you the two images side-by-side, you will notice that the Poisson distribution’s un-detected regions are slightly darker (this is because of the skewness of the Poisson distribution). Finally, with the last two commands, you can see the histograms of the two distributions:

$ astarithmetic raw.fits 4 mknoise-poisson \
                --output=poisson.fits

$ astscript-fits-view sigma-from-mean.fits poisson.fits

$ aststatistics sigma-from-mean.fits --lessthan=10
-------
Histogram:
 |                       ***
 |                      ******
 |                    **********
 |                    ***********
 |                  **************
 |                 ****************
 |                ******************
 |              **********************
 |             **************************
 |          **********************************  *
 |* **********************************************************
 |------------------------------------------------------------

$ aststatistics poisson.fits --lessthan=10
-------
Histogram:
 |                    *     *
 |                    *     *
 |                    *     *      *
 |             *      *     *      *
 |             *      *     *      *      *
 |             *      *     *      *      *
 |      *      *      *     *      *      *     *
 |      *      *      *     *      *      *     *
 |      *      *      *     *      *      *     *      *
 |      *      *      *     *      *      *     *      *
 |      *      *      *     *      *      *     *      *
 |------------------------------------------------------------

The extra skewness in the Poisson distribution, and the fact that it only returns integers is therefore clear with the commands above. The comparison was further made above in the description of mknoise-sigma-from-mean. In summary, you should prefer the Poisson distribution when you are simulating the following scenarios:

A photon-starved image (as in UV or X-ray).
A raw exposure of a photon-rich image (which may be photon-rich, but always integers).

mknoise-uniform

Add uniform noise to each element of the input dataset. This operator takes two arguments: the top/first popped operand is the width of the interval, the second popped operand is the dataset that the noise should be added to (each element will be the center of the interval). The returned random values may happen to be the minimum interval value, but will never be the maximum. Except for the noise-model, this operator behaves very similar to mknoise-sigma, see the explanation there for more.

For example, with the command below, a random value will be selected between 10 to 14 (centered on 12, which is the only input data element, with a total width of 4).

echo 12 | asttable -c'arith $1 4 mknoise-uniform'

Similar to the example in mknoise-sigma, you can pipe the output of echo to awk before passing it to asttable to generate a full column of uniformly selected values within the same interval.

random-from-hist-raw

Generate random values from a custom distribution (defined by a histogram). The output will have a double-precision floating point type (see Numeric data types). This operator takes three operands:

The first popped operand (nearest to the operator) is the histogram values. The histogram is a 1-dimensional dataset (a table column) and contains the probability of obtaining a certain interval of values. The histogram does not have to be normalized: the GNU Scientific Library (or GSL, which is used by Gnuastro for this operator), will normalize it internally. The value of each bin (whose probability is given in the histogram) is given in the second popped operand. Therefore these two operands have to have the same number of rows.
The second popped operand is the bin value (mostly the bin center, but it can be anything). The probability of each bin is defined in the histogram operand (first popped operand). The bins can have any width (do not have to be evenly spaced), and any order. Just make sure that the same row in the bins column corresponds to the same row in the histogram: the number of rows in the bins and histogram must be equal.
The third popped operand is the dataset that the random values should be written over. Effectively only its size will be used by this operator (all values will be over-written as a double-precision floating point number).

The first two operands have to be single-dimensional (a table column) and have the same number of rows, but the last popped operand can have any number of dimensions. You can use the load-col- operator to load the two bins and histogram columns from an external file (see Loading external columns).

For example, in the command below, we first construct a fake histogram to represent a \(y=x^2\) distribution with AWK. We aim to distribute random values from this distribution in a \(100\times100\) image. Therefore, we use the makenew operator to construct an empty image of that size, use the load-col- operator to load the histogram columns into Arithmetic and put the output in random.fits. Finally we visually inspect random.fits with DS9 and also have a look at its pixel distribution with aststatistics.

$ echo "" | awk '{for(i=1;i<5;++i) print i, i*i}' \
                > histogram.txt

$ cat histogram.txt
1 1
2 4
3 9
4 16

$ astarithmetic 100 100 2 makenew \
                load-col-1-from-histogram.txt \
                load-col-2-from-histogram.txt \
                random-from-hist-raw \
                --output=random.fits

$ astscript-fits-view random.fits

$ aststatistics random.fits --asciihist --numasciibins=50
 |                                                 *
 |                                                 *
 |                                                 *
 |                                                 *
 |                                 *               *
 |                                 *               *
 |                                 *               *
 |                *                *               *
 |                *                *               *
 |*               *                *               *
 |*               *                *               *
 |--------------------------------------------------

As you see, the 10000 pixels in the image only have values 1, 2, 3 or 4 (which were the values in the bins column of histogram.txt), and the number of times each of these values occurs follows the \(y=x^2\) distribution.

Generally, any value given in the bins column will be used for the final output values. For example, in the command below (for generating a histogram from an analytical function), we are adding the bins by 20 (while keeping the same probability distribution of \(y=x^2\)). If you re-run the Arithmetic command above after this, you will notice that the pixels values are now one of the following 21, 22, 23 or 24 (instead of 1, 2, 3, or 4). But the shape of the histogram of the resulting random distribution will be unchanged.

$ echo "" | awk '{for(i=1;i<5;++i) print 20+i, i*i}' \
                > histogram.txt

If you do not want the outputs to have exactly the value of the bin identifier, but be a randomly selected value from a uniform distribution within the bin, you should use random-from-hist (see below).

As mentioned above, the output will have a double-precision floating point type (see Numeric data types). Therefore, by default each element of the output will consume 8 bytes (64-bits) of storage. This is usually far more than the statistical error/precision of your data (and just results in wasted storage in your file system, or wasted RAM when a program that uses the data is being run, and a slower running time of the program).

It is therefore recommended to use a type-conversion operator after this operator to put the output in the smallest type that can be used to safely store your data without wasting storage, RAM or time. For the list of type conversion operators, see Numerical type conversion operators. Recall that you already know the values returned by this operator (they are one of the values in the bins column).

For example, in the example above, the whole image only has values 1, 2, 3 or 4. Since they are always positive and are below 255, we can safely place them in an unsigned 8-bit integer (see Numeric data types) with the command below (note the uint8 after the operator name, and that we are using a different name for the output). After building the new image, let’s have a look at the sizes of the two images with ls -l:

$ astarithmetic 100 100 2 makenew \
                load-col-1-from-histogram.txt \
                load-col-2-from-histogram.txt \
                random-from-hist-raw uint8 \
                --output=random-u8.fits

$ ls -lh random.fits random-u8.fits
-rw-r--r-- 1 name name 85K Jan 01 13:40 random.fits
-rw-r--r-- 1 name name 17K Jan 01 13:45 random-u8.fits

As you see, when using a suitable data type, we can shrink the size of the file significantly without loosing any information (from 85 kilobytes to 17 kilobytes). This difference can be felt much better for larger (real-world) datasets, so be sure to always set the output data type after calling this operator.

random-from-hist

Similar to random-from-hist-raw, but do not return the exact bin value, instead return a random value from a uniform distribution within each bin. Therefore the following limitations have to be taken into account (compared to random-from-hist-raw):

The number associated with each bin (in the bin column) should be its center.
The bins have to be in descending order (so the second row in the bin column is larger than the first).
The bin widths (distance from one bin to another) have to be fixed.

For a demonstration, let’s replace random-from-hist-raw with random-from-hist in the example of the description of random-from-hist-raw. Note how we are manually converting the output of this operator into single-precision floating point (32-bit, since the default 64-bit precision is statistically meaningless in this scenario and we do not want to waste storage, memory and running time):

$ echo "" | awk '{for(i=1;i<5;++i) print i, i*i}' \
                > histogram.txt

$ astarithmetic 100 100 2 makenew \
                load-col-1-from-histogram.txt \
                load-col-2-from-histogram.txt \
                random-from-hist float32 \
                --output=random.fits

$ aststatistics random.fits --asciihist --numasciibins=50
 |                                          *
 |                                      *** ********
 |                                      ************
 |                                     *************
 |                             *     * *************
 |                         * ***********************
 |                         *************************
 |                         *************************
 |             *************************************
 |********* * **************************************
 |**************************************************
 |--------------------------------------------------

You can see that the pixels of histogram.fits are no longer just 1, 2, 3 or 4. Instead, the values within each bin are selected from a uniform distribution covering that bin. This creates the step-like feature in the histogram of the output.

Of course, this extra uniform random number generation can make your program slower so be sure to check if it is worth it. In particular, one way to avoid this (and use random-from-hist-raw with a more contiguous-looking output distribution) is to simply use a higher-resolution histogram (assuming it is possible: you have a sufficient number of data points, or you have an analytical expression that you can sample at smaller bin sizes).

To better demonstrate this operator and its practical usage in everyday research, let’s look at another example: Assume you want to get 100 random star magnitudes that follow the real-world Gaia Data release 3 magnitude distribution within a radius of 2 degrees around the (RA,Dec) coordinate of (1.23,4.56). Let’s further assume that you want to distribute them uniformly over an image of size 1000 by 1000 pixels. So your desired output table should have three columns, the first two are pixel positions of each star, and the third is the magnitude.

First, we need to query the Gaia database and ask for all the magnitudes in this region of the sky. We know that Gaia is not complete for stars fainter than the 20th magnitude, so we will use the --range option and only ask for those stars that are brighter than magnitude 20.

$ astquery gaia --dataset=dr3 --center=1.23,3.45 --radius=2 \
           --column=phot_g_mean_mag --output=gaia.fits \
           --range=phot_g_mean_mag,-inf,20

We now have more than 25000 magnitudes in gaia.fits! To get a more accurate random sampling of our stars, let’s construct a histogram with 500 bins, and generate our three desired randomly selected columns:

$ aststatistics gaia.fits --histogram --numbins=500 \
                --output=gaia-hist.fits

$ asttable gaia-hist.fits -i

$ echo 1000 \
    | awk '{for(i=0;i<100;++i) print $1/2}' \
    | asttable -c'arith $1 500 mknoise-uniform' \
               -c'arith $1 500 mknoise-uniform' \
               -c'arith $1 \
                        load-col-1-from-gaia-hist.fits-hdu-1 \
                        load-col-2-from-gaia-hist.fits-hdu-1 \
                        random-from-hist float32'

These columns can easily be placed in the format for MakeProfiles to be inserted into an image automatically.

Footnotes

(161)

We have defined Task 15971 in Gnuastro’s project management system to address this. If you need this feature please send us an email at bug-gnuastro@gnu.org (to motivate us in its implementation).