Least squares fitting (GNU Astronomy Utilities)

Next: Sky value, Previous: 2D Histograms, Up: Statistics [Contents][Index]

7.1.3 Least squares fitting ¶

After completing a good observation, doing robust data reduction and finalizing the measurements, it is commonly necessary to parameterize the derived correlations. For example, you have derived the radial profile of the PSF of your image (see Building the extended PSF). You now want to parameterize the radial profile to estimate the slope. Alternatively, you may have found the star formation rate and stellar mass of your sample of galaxies. Now, you want to derive the star formation main sequence as a parametric relation between the two. The fitting functions below can be used for such purposes.

Gnuastro’s least squares fitting features are just wrappers over the least squares fitting methods of the linear and nonlinear least-squares fitting functions of the GNU Scientific Library (GSL). For the low-level details and equations of the methods, please see the GSL documentation. The names have been preserved here in Gnuastro to simplify the connection with GSL and follow the details in the detailed documentation there.

GSL is a very low-level library, designed for maximal portability to many scenarios, and power. Therefore calling GSL’s functions directly for a fast operation requires a good knowledge of the C programming language and many lines of code. As a low-level library, GSL is designed to be the back-end of higher-level programs (like Gnuastro). Through the Statistics program, in Gnuastro we provide a high-level interface to access to GSL’s very powerful least squares fitting engine to read/write from/to standard data formats in astronomy. A fully working example is shown below.

To activate fitting in Statistics, simply give your desired fitting method to the --fit option (for the full list of acceptable methods, see Fitting options). For example, with the command below, we’ll build a fake measurement table (including noise) from the polynomial \(y=1.23-4.56x+7.89x^2\). To understand how this equation translates to the command below (part before set-y), see Reverse polish notation and Column arithmetic. We will set the X axis to have values from 0.1 to 2, with steps of 0.01 and let’s assume a random Gaussian noise to each \(y\) measurement: \(\sigma_y=0.1y\). To make the random number generation exactly reproducible, we are also setting the seed (see Generating random numbers, which also uses GSL as a backend). To learn more about the mknoise-sigma operator, see the Arithmetic program’s Random number generators.

$ export GSL_RNG_SEED=1664015492
$ seq 0.1 0.01 2 \
      | asttable --output=noisy.fits --envseed -c1 \
                 -c'arith 1.23 -4.56 $1 x + 7.89 $1 x $1 x + set-y \
                          0.1 y x                            set-yerr \
                          y yerr mknoise-sigma yerr' \
                 --colmetadata=1,X --colmetadata=2,Y \
                 --colmetadata=3,Yerr

Let’s have a look at the output plot with TOPCAT using the command below.

$ astscript-fits-view noisy.fits

To see the error-bars, after opening the scatter plot, go into the “Form” tab for that plot. Click on the button with a green “+” sign followed by “Forms” and select “XYError”. On the side-menu, in front of “Y Positive Error”, select the Yerr column of the input table.

As you see, the error bars do indeed increase for higher X axis values. Since we have error bars in this example (as in any measurement), we can use weighted fitting. Also, this isn’t a linear relation, so we’ll use a polynomial to second order (a maximum power of 2 in the form of \(Y=c_0+c_1X+c_2X^2\)):

$ aststatistics noisy.fits -cX,Y,Yerr --fit=polynomial-weighted \
                --fitmaxpower=2
Statistics (GNU Astronomy Utilities) 0.23
-------
Fitting results (remove extra info with '--quiet' or '-q)
  Input file:    noisy.fits (hdu: 1) with 191 non-blank rows.
  X      column: X
  Y      column: Y
  Weight column: Yerr    [Standard deviation of Y in each row]

Fit function: Y = c0 + (c1 * X^1) + (c2 * X^2) + ... (cN * X^N)
  N:  2
  c0:  +1.2286211608
  c1:  -4.5127796636
  c2:  +7.8435883943

Covariance matrix:
  +0.0010496001        -0.0039928488        +0.0028367390
  -0.0039928488        +0.0175244127        -0.0138030778
  +0.0028367390        -0.0138030778        +0.0128129806

Reduced chi^2 of fit:
  +0.9740670090

As you see from the elaborate message, the weighted polynomial fitting has found return the \(c_0\), \(c_1\) and \(c_2\) of \(Y=c_0+c_1X+c_2X^2\) that best represents the data we inserted. Our input values were \(c_0=1.23\), \(c_1=-4.56\) and \(c_2=7.89\), and the fitted values are \(c_0\approx1.2286\), \(c_1\approx-4.5128\) and \(c_2\approx7.8436\) (which is statistically a very good fit! given that we knew the original values a-priori!). The covariance matrix is also calculated, it is necessary to calculate error bars on the estimations and contains a lot of information (e.g., possible correlations between parameters). Finally, the reduced \(\chi^2\) (or \(\chi_{red}^2\)) of the fit is also printed (which was the measure to minimize). A \(\chi_{red}^2\approx1\) shows a good fit. This is good for real-world scenarios when you don’t know the original values a-priori. For more on interpreting \(\chi_{red}^2\approx1\), see Andrae et al. 2010.

The comparison of fitted and input values look pretty good, but nothing beats visual inspection! To see how this looks compared to the data, let’s open the table again:

$ astscript-fits-view noisy.fits

Repeat the steps below to show the scatter plot and error-bars. Then, go to the “Layers” menu and select “Add Function Control”. Use the results above to fill the box in front of “Function Expression”: 1.2286+(-4.5128*x)+(7.8436*x*x). You will see that the second order polynomial falls very nicely over the points¹⁹⁵. But this fit is not perfect: it also has errors (inherited from the measurement errors). We need the covariance matrix to estimate the errors on each point, and that can be complex to do by hand.

Fortunately GSL has the tools to easily estimate the function at any point and also calculate its corresponding error. To access this feature within Gnuastro’s Statistics program, you should use the --fitestimate option. You can either give an independent table file name (with --fitestimatehdu and --fitestimatecol to specify the HDU and column in that file), or just self so it uses the same X axis column that was used in this fit. Let’s use the easier case:

$ aststatistics noisy.fits -cX,Y,Yerr --fit=polynomial-weighted \
                --fitmaxpower=2 --fitestimate=self --output=est.fits

...[[truncated; same as above]]...

Requested estimation:
  Written to: est.fits

The first lines of the printed text are the same as before. Afterwards, you will see a new line printed in the output, saying that the estimation was written in est.fits. You can now inspect the two tables with TOPCAT again with the command below. After TOPCAT opens, plot both scatter plots:

$ astscript-fits-view noisy.fits est.fits

It is clear that they fall nicely on top of each other. The est.fits table also has a third column with error bars. You can follow the same steps before and draw the error bars to see how they compare with the scatter of the measured data. They are much smaller than the error in each point because we had a very good sampling of the function in our noisy data.

Another useful point with the estimated output file is that it contains all the fitting outputs as keywords in the header:

$ astfits est.fits -h1
...[[truncated]]...

                      / Fit results
FITTYPE = 'polynomial-weighted' / Functional form of the fitting.
FITMAXP =                    2 / Maximum power of polynomial.
FITIN   = 'noisy.fits'         / Name of file with input columns.
FITINHDU= '1       '           / Name or Number of HDU with input cols.
FITXCOL = 'X       '           / Name or Number of independent (X) col.
FITYCOL = 'Y       '           / Name or Number of measured (Y) column.
FITWCOL = 'Yerr    '           / Name or Number of weight column.
FITWNAT = 'Standard deviation' / Nature of weight column.
FRDCHISQ=    0.974067008958516 / Reduced chi^2 of fit.
FITC0   =     1.22862116084727 / C0: multiple of x^0 in polynomial
FITC1   =    -4.51277966356177 / C1: multiple of x^1 in polynomial
FITC2   =     7.84358839431161 / C2: multiple of x^2 in polynomial
FCOV11  =  0.00104960011629718 / Covariance matrix element (1,1).
FCOV12  = -0.00399284880859776 / Covariance matrix element (1,2).
FCOV13  =  0.00283673901863388 / Covariance matrix element (1,3).
FCOV21  = -0.00399284880859776 / Covariance matrix element (2,1).
FCOV22  =   0.0175244126670659 / Covariance matrix element (2,2).
FCOV23  =  -0.0138030778380786 / Covariance matrix element (2,3).
FCOV31  =  0.00283673901863388 / Covariance matrix element (3,1).
FCOV32  =  -0.0138030778380786 / Covariance matrix element (3,2).
FCOV33  =   0.0128129806394559 / Covariance matrix element (3,3).

...[[truncated]]...

In scenarios were you don’t want the estimation, but only the fitted parameters, all that verbose, human-friendly text or FITS keywords can be an annoying extra step. For such cases, you should use the --quiet option like below. It will print the parameters, rows of the covariance matrix and \(\chi_{red}^2\) on separate lines with nothing extra. This allows you to parse the values in any way that you would like.

$ aststatistics noisy.fits -cX,Y,Yerr --fit=polynomial-weighted \
                --fitmaxpower=2 --quiet
+1.2286211608 -4.5127796636 +7.8435883943
+0.0010496001        -0.0039928488        +0.0028367390
-0.0039928488        +0.0175244127        -0.0138030778
+0.0028367390        -0.0138030778        +0.0128129806
+0.9740670090

As a final example, because real data usually have outliers, let’s look at the “robust” polynomial fit which has special features to remove outliers. First, we need to add some outliers to the table. To do this, we’ll make a plain-text table with echo, and use Table’s --catrowfile to concatenate (or append) those two rows to the original table. Finally, we’ll run the same fitting step above:

$ echo "0.6  20  0.01"  > outliers.txt
$ echo "0.8  20  0.01" >> outliers.txt

$ asttable noisy.fits --catrowfile=outliers.txt \
           --output=with-outlier.fits

$ aststatistics with-outlier.fits -cX,Y,Yerr --fit=polynomial-weighted \
                --fitmaxpower=2 --fitestimate=self \
                --output=est-out.fits

Statistics (GNU Astronomy Utilities) 0.23
-------
Fitting results (remove extra info with '--quiet' or '-q)
  Input file:    with-outlier.fits (hdu: 1) with 193 non-blank rows.
  X      column: X
  Y      column: Y
  Weight column: Yerr    [Standard deviation of Y in each row]

Fit function: Y = c0 + (c1 * X^1) + (c2 * X^2) + ... (cN * X^N)
  N:  2
  c0:  -13.6446036899
  c1:  +66.8463258547
  c2:  -30.8746303591

Covariance matrix:
  +0.0007889160        -0.0027706310        +0.0022208939
  -0.0027706310        +0.0113922468        -0.0100306732
  +0.0022208939        -0.0100306732        +0.0094087226

Reduced chi^2 of fit:
  +4501.8356719150

Requested estimation:
  Written to: est-out.fit

We see that the coefficient values have changed significantly and that \(\chi_{red}^2\) has increased to \(4501\)! Recall that a good fit should have \(\chi_{red}^2\approx1\). These numbers clearly show that the fit was bad, but again, nothing beats a visual inspection. To visually see the effect of those outliers, let’s plot them with the command below. You see that those two points have clearly caused a turn in the fitted result which is terrible.

$ astscript-fits-view with-outlier.fits est-out.fits

For such cases, GSL has Robust linear regression. In Gnuastro’s Statistics, you can access it with --fit=polynomial-robust, like the example below. Just note that the robust method doesn’t take an error column (because it estimates the errors internally while rejecting outliers, based on the method).

$ aststatistics with-outlier.fits -cX,Y --fit=polynomial-robust \
                --fitmaxpower=2 --fitestimate=self \
                --output=est-out.fits --quiet

$ astfits est-out.fits -h1 | grep ^FITC
FITC0   =     1.20422691185238 / C0: multiple of x^0 in polynomial
FITC1   =     -4.4779253576348 / C1: multiple of x^1 in polynomial
FITC2   =     7.84986153686548 / C2: multiple of x^2 in polynomial

$ astscript-fits-view with-outlier.fits est-out.fits

It is clear that the coefficients are very similar to the no-outlier scenario above and if you run the second command to view the scatter plots on TOPCAT, you also see that the fit nicely follows the curve and is not affected by those two points. GSL provides many methods to reject outliers. For their full list, see the description of --fitrobust in Fitting options. For a description of the outlier rejection methods, see the GSL manual.

You may have noticed that unlike the cases before the last Statistics command above didn’t print anything on the standard output. This is becasue --quiet and --fitestimate were called together. In this case, because all the fitting parameters are written as FITS keywords, because of the --quiet option, they are no longer printed on standard output.

Footnotes

(195)

After plotting, you will notice that the legend made the plot too thin. Fortunately you have a lot of empty space within the plot. To bring the legend in, click on the “Legend” item on the bottom-left menu, in the “Location” tab, click on “Internal” and hold and move it to the top-left in the box below. To make the functional fit more clear, you can click on the “Function” item of the bottom-left menu. In the “Style” tab, change the color and thickness.