Go to the first, previous, next, last section, table of contents.


Least-Squares Fitting

This chapter describes routines for performing least squares fits to experimental data using linear combinations of functions. The data may be weighted or unweighted, i.e. with known or unknown errors. For weighted data the functions compute the best fit parameters and their associated covariance matrix. For unweighted data the covariance matrix is estimated from the scatter of the points, giving a variance-covariance matrix.

The functions are divided into separate versions for simple one- or two-parameter regression and multiple-parameter fits. The functions are declared in the header file `gsl_fit.h'.

Overview

Least-squares fits are found by minimizing \chi^2 (chi-squared), the weighted sum of squared residuals over n experimental datapoints (x_i, y_i) for the model Y(c,x),

\chi^2 = \sum_i w_i (y_i - Y(c, x_i))^2

The p parameters of the model are c = {c_0, c_1, ...}. The weight factors w_i are given by w_i = 1/\sigma_i^2, where \sigma_i is the experimental error on the data-point y_i. The errors are assumed to be Gaussian and uncorrelated. For unweighted data the chi-squared sum is computed without any weight factors.

The fitting routines return the best-fit parameters c and their p \times p covariance matrix. The covariance matrix measures the statistical errors on the best-fit parameters resulting from the errors on the data, \sigma_i, and is defined as C_{ab} = <\delta c_a \delta c_b> where < > denotes an average over the Gaussian error distributions of the underlying datapoints.

The covariance matrix is calculated by error propagation from the data errors \sigma_i. The change in a fitted parameter \delta c_a caused by a small change in the data \delta y_i is given by

\delta c_a = \sum_i (dc_a/dy_i) \delta y_i

allowing the covariance matrix to be written in terms of the errors on the data,

C_{ab} = \sum_{i,j} (dc_a/dy_i) (dc_b/dy_j) <\delta y_i \delta y_j>

For uncorrelated data the fluctuations of the underlying datapoints satisfy <\delta y_i \delta y_j> = \sigma_i^2 \delta_{ij}, giving a corresponding parameter covariance matrix of

C_{ab} = \sum_i (1/w_i) (dc_a/dy_i) (dc_b/dy_i) 

When computing the covariance matrix for unweighted data, i.e. data with unknown errors, the weight factors w_i in this sum are replaced by the single estimate w = 1/\sigma^2, where \sigma^2 is the computed variance of the residuals about the best-fit model, \sigma^2 = \sum (y_i - Y(c,x_i))^2 / (n-p). This is referred to as the variance-covariance matrix.

The standard deviations of the best-fit parameters are given by the square root of the corresponding diagonal elements of the covariance matrix, \sigma_{c_a} = \sqrt{C_{aa}}. The correlation coefficient of the fit parameters c_a and c_b is given by \rho_{ab} = C_{ab} / \sqrt{C_{aa} C_{bb}}.

Linear regression

The functions described in this section can be used to perform least-squares fits to a straight line model, Y(c,x) = c_0 + c_1 x.

Function: int gsl_fit_linear (const double * x, const size_t xstride, const double * y, const size_t ystride, size_t n, double * c0, double * c1, double * cov00, double * cov01, double * cov11, double * sumsq)
This function computes the best-fit linear regression coefficients (c0,c1) of the model Y = c_0 + c_1 X for the dataset (x, y), two vectors of length n with strides xstride and ystride. The errors on y are assumed unknown so the variance-covariance matrix for the parameters (c0, c1) is estimated from the scatter of the points around the best-fit line and returned via the parameters (cov00, cov01, cov11). The sum of squares of the residuals from the best-fit line is returned in sumsq. Note: the correlation coefficient of the data can be computed using gsl_stats_correlation (see section Correlation), it does not depend on the fit.

Function: int gsl_fit_wlinear (const double * x, const size_t xstride, const double * w, const size_t wstride, const double * y, const size_t ystride, size_t n, double * c0, double * c1, double * cov00, double * cov01, double * cov11, double * chisq)
This function computes the best-fit linear regression coefficients (c0,c1) of the model Y = c_0 + c_1 X for the weighted dataset (x, y), two vectors of length n with strides xstride and ystride. The vector w, of length n and stride wstride, specifies the weight of each datapoint. The weight is the reciprocal of the variance for each datapoint in y.

The covariance matrix for the parameters (c0, c1) is computed using the weights and returned via the parameters (cov00, cov01, cov11). The weighted sum of squares of the residuals from the best-fit line, \chi^2, is returned in chisq.

Function: int gsl_fit_linear_est (double x, double c0, double c1, double cov00, double cov01, double cov11, double * y, double * y_err)
This function uses the best-fit linear regression coefficients c0, c1 and their covariance cov00, cov01, cov11 to compute the fitted function y and its standard deviation y_err for the model Y = c_0 + c_1 X at the point x.

Linear fitting without a constant term

The functions described in this section can be used to perform least-squares fits to a straight line model without a constant term, Y = c_1 X.

Function: int gsl_fit_mul (const double * x, const size_t xstride, const double * y, const size_t ystride, size_t n, double * c1, double * cov11, double * sumsq)
This function computes the best-fit linear regression coefficient c1 of the model Y = c_1 X for the datasets (x, y), two vectors of length n with strides xstride and ystride. The errors on y are assumed unknown so the variance of the parameter c1 is estimated from the scatter of the points around the best-fit line and returned via the parameter cov11. The sum of squares of the residuals from the best-fit line is returned in sumsq.

Function: int gsl_fit_wmul (const double * x, const size_t xstride, const double * w, const size_t wstride, const double * y, const size_t ystride, size_t n, double * c1, double * cov11, double * sumsq)
This function computes the best-fit linear regression coefficient c1 of the model Y = c_1 X for the weighted datasets (x, y), two vectors of length n with strides xstride and ystride. The vector w, of length n and stride wstride, specifies the weight of each datapoint. The weight is the reciprocal of the variance for each datapoint in y.

The variance of the parameter c1 is computed using the weights and returned via the parameter cov11. The weighted sum of squares of the residuals from the best-fit line, \chi^2, is returned in chisq.

Function: int gsl_fit_mul_est (double x, double c1, double cov11, double * y, double * y_err)
This function uses the best-fit linear regression coefficient c1 and its covariance cov11 to compute the fitted function y and its standard deviation y_err for the model Y = c_1 X at the point x.

Multi-parameter fitting

The functions described in this section perform least-squares fits to a general linear model, y = X c where y is a vector of n observations, X is an n by p matrix of predictor variables, and the elements of the vector c are the p unknown best-fit parameters which are to be estimated. The chi-squared value is given by \chi^2 = \sum_i w_i (y_i - \sum_j X_{ij} c_j)^2.

This formulation can be used for fits to any number of functions and/or variables by preparing the n-by-p matrix X appropriately. For example, to fit to a p-th order polynomial in x, use the following matrix,

X_{ij} = x_i^j

where the index i runs over the observations and the index j runs from 0 to p-1.

To fit to a set of p sinusoidal functions with fixed frequencies \omega_1, \omega_2, ..., \omega_p, use,

X_{ij} = sin(\omega_j x_i)

To fit to p independent variables x_1, x_2, ..., x_p, use,

X_{ij} = x_j(i)

where x_j(i) is the i-th value of the predictor variable x_j.

The functions described in this section are declared in the header file `gsl_multifit.h'.

The solution of the general linear least-squares system requires an additional working space for intermediate results, such as the singular value decomposition of the matrix X.

Function: gsl_multifit_linear_workspace * gsl_multifit_linear_alloc (size_t n, size_t p)
This function allocates a workspace for fitting a model to n observations using p parameters.

Function: void gsl_multifit_linear_free (gsl_multifit_linear_workspace * work)
This function frees the memory associated with the workspace w.

Function: int gsl_multifit_linear (const gsl_matrix * X, const gsl_vector * y, gsl_vector * c, gsl_matrix * cov, double * chisq, gsl_multifit_linear_workspace * work)
This function computes the best-fit parameters c of the model y = X c for the observations y and the matrix of predictor variables X, using the preallocated workspace provided in work. The p-by-p variance-covariance matrix of the model parameters cov is set to \sigma^2 (X^T X)^{-1}, where \sigma is the standard deviation of the fit residuals. The sum of squares of the residuals from the best-fit, \chi^2, is returned in chisq. If the coefficient of determination is desired, it can be computed from the expression R^2 = 1 - \chi^2 / TSS, where the total sum of squares (TSS) of the observations y may be computed from gsl_stats_tss.

The best-fit is found by singular value decomposition of the matrix X using the modified Golub-Reinsch SVD algorithm, with column scaling to improve the accuracy of the singular values. Any components which have zero singular value (to machine precision) are discarded from the fit.

Function: int gsl_multifit_wlinear (const gsl_matrix * X, const gsl_vector * w, const gsl_vector * y, gsl_vector * c, gsl_matrix * cov, double * chisq, gsl_multifit_linear_workspace * work)
This function computes the best-fit parameters c of the weighted model y = X c for the observations y with weights w and the matrix of predictor variables X, using the preallocated workspace provided in work. The p-by-p covariance matrix of the model parameters cov is computed as (X^T W X)^{-1}. The weighted sum of squares of the residuals from the best-fit, \chi^2, is returned in chisq. If the coefficient of determination is desired, it can be computed from the expression R^2 = 1 - \chi^2 / WTSS, where the weighted total sum of squares (WTSS) of the observations y may be computed from gsl_stats_wtss.

Function: int gsl_multifit_linear_svd (const gsl_matrix * X, const gsl_vector * y, double tol, size_t * rank, gsl_vector * c, gsl_matrix * cov, double * chisq, gsl_multifit_linear_workspace * work)
Function: int gsl_multifit_wlinear_svd (const gsl_matrix * X, const gsl_vector * w, const gsl_vector * y, double tol, size_t * rank, gsl_vector * c, gsl_matrix * cov, double * chisq, gsl_multifit_linear_workspace * work)
In these functions components of the fit are discarded if the ratio of singular values s_i/s_0 falls below the user-specified tolerance tol, and the effective rank is returned in rank.

Function: int gsl_multifit_linear_usvd (const gsl_matrix * X, const gsl_vector * y, double tol, size_t * rank, gsl_vector * c, gsl_matrix * cov, double * chisq, gsl_multifit_linear_workspace * work)
Function: int gsl_multifit_wlinear_usvd (const gsl_matrix * X, const gsl_vector * w, const gsl_vector * y, double tol, size_t * rank, gsl_vector * c, gsl_matrix * cov, double * chisq, gsl_multifit_linear_workspace * work)
These functions compute the fit using an SVD without column scaling.

Function: int gsl_multifit_linear_est (const gsl_vector * x, const gsl_vector * c, const gsl_matrix * cov, double * y, double * y_err)
This function uses the best-fit multilinear regression coefficients c and their covariance matrix cov to compute the fitted function value y and its standard deviation y_err for the model y = x.c at the point x.

Function: int gsl_multifit_linear_residuals (const gsl_matrix * X, const gsl_vector * y, const gsl_vector * c, gsl_vector * r)
This function computes the vector of residuals r = y - X c for the observations y, coefficients c and matrix of predictor variables X.

Robust linear regression

Ordinary least squares (OLS) models are often heavily influenced by the presence of outliers. Outliers are data points which do not follow the general trend of the other observations, although there is strictly no precise definition of an outlier. Robust linear regression refers to regression algorithms which are robust to outliers. The most common type of robust regression is M-estimation. The general M-estimator minimizes the objective function

\sum_i \rho(e_i) = \sum_i \rho (y_i - Y(c, x_i))

where e_i = y_i - Y(c, x_i) is the residual of the ith data point, and \rho(e_i) is a function which should have the following properties:

The special case of ordinary least squares is given by \rho(e_i) = e_i^2. Letting \psi = \rho' be the derivative of \rho, differentiating the objective function with respect to the coefficients c and setting the partial derivatives to zero produces the system of equations

\sum_i \psi(e_i) X_i = 0

where X_i is a vector containing row i of the design matrix X. Next, we define a weight function w(e) = \psi(e)/e, and let w_i = w(e_i):

\sum_i w_i e_i X_i = 0

This system of equations is equivalent to solving a weighted ordinary least squares problem, minimizing \chi^2 = \sum_i w_i e_i^2. The weights however, depend on the residuals e_i, which depend on the coefficients c, which depend on the weights. Therefore, an iterative solution is used, called Iteratively Reweighted Least Squares (IRLS).

  1. Compute initial estimates of the coefficients c^{(0)} using ordinary least squares
  2. For iteration k, form the residuals e_i^{(k) = (y_i - X_i c^{(k-1)})/(t \sigma^{(k)} \sqrt{1 - h_i})}, where t is a tuning constant depending on the choice of \psi, and h_i are the statistical leverages (diagonal elements of the matrix X (X^T X)^{-1 X^T}). Including t and h_i in the residual calculation has been shown to improve the convergence of the method. The residual standard deviation is approximated as \sigma^{(k) = MAD / 0.6745}, where MAD is the Median-Absolute-Deviation of the n-p largest residuals from the previous iteration.
  3. Compute new weights w_i^{(k) = \psi(e_i^{(k)})/e_i^{(k)}}.
  4. Compute new coefficients c^{(k)} by solving the weighted least squares problem with weights w_i^{(k)}.
  5. Steps 2 through 4 are iterated until the coefficients converge or until some maximum iteration limit is reached. Coefficients are tested for convergence using the critera:
    |c_i^(k) - c_i^(k-1)| \le \epsilon \times max(|c_i^(k)|, |c_i^(k-1)|)
    
    for all 0 \le i < p where \epsilon is a small tolerance factor.

The key to this method lies in selecting the function \psi(e_i) to assign smaller weights to large residuals, and larger weights to smaller residuals. As the iteration proceeds, outliers are assigned smaller and smaller weights, eventually having very little or no effect on the fitted model.

Function: gsl_multifit_robust_workspace * gsl_multifit_robust_alloc (const gsl_multifit_robust_type * T, const size_t n, const size_t p)
This function allocates a workspace for fitting a model to n observations using p parameters. The type T specifies the function \psi and can be selected from the following choices.
Robust type: gsl_multifit_robust_default
This specifies the gsl_multifit_robust_bisquare type (see below) and is a good general purpose choice for robust regression.

Robust type: gsl_multifit_robust_bisquare
This is Tukey's biweight (bisquare) function and is a good general purpose choice for robust regression. The weight function is given by

w(e) = (1 - e^2)^2

and the default tuning constant is t = 4.685.

Robust type: gsl_multifit_robust_cauchy
This is Cauchy's function, also known as the Lorentzian function. This function does not guarantee a unique solution, meaning different choices of the coefficient vector c could minimize the objective function. Therefore this option should be used with care. The weight function is given by

w(e) = 1 / (1 + e^2)

and the default tuning constant is t = 2.385.

Robust type: gsl_multifit_robust_fair
This is the fair \rho function, which guarantees a unique solution and has continuous derivatives to three orders. The weight function is given by

w(e) = 1 / (1 + |e|)

and the default tuning constant is t = 1.400.

Robust type: gsl_multifit_robust_huber
This specifies Huber's \rho function, which is a parabola in the vicinity of zero and increases linearly for a given threshold |e| > t. This function is also considered an excellent general purpose robust estimator, however, occasional difficulties can be encountered due to the discontinuous first derivative of the \psi function. The weight function is given by

w(e) = 1/max(1,|e|)

and the default tuning constant is t = 1.345.

Robust type: gsl_multifit_robust_ols
This specifies the ordinary least squares solution, which can be useful for quickly checking the difference between the various robust and OLS solutions. The weight function is given by

w(e) = 1

and the default tuning constant is t = 1.

Robust type: gsl_multifit_robust_welsch
This specifies the Welsch function which can perform well in cases where the residuals have an exponential distribution. The weight function is given by

w(e) = \exp(-e^2)

and the default tuning constant is t = 2.985.

Function: void gsl_multifit_robust_free (gsl_multifit_robust_workspace * w)
This function frees the memory associated with the workspace w.

Function: const char * gsl_multifit_robust_name (const gsl_multifit_robust_workspace * w)
This function returns the name of the robust type T specified to gsl_multifit_robust_alloc.

Function: int gsl_multifit_robust_tune (const double tune, gsl_multifit_robust_workspace * w)
This function sets the tuning constant t used to adjust the residuals at each iteration to tune. Decreasing the tuning constant increases the downweight assigned to large residuals, while increasing the tuning constant decreases the downweight assigned to large residuals.

Function: int gsl_multifit_robust (const gsl_matrix * X, const gsl_vector * y, gsl_vector * c, gsl_matrix * cov, gsl_multifit_robust_workspace * w)
This function computes the best-fit parameters c of the model y = X c for the observations y and the matrix of predictor variables X, attemping to reduce the influence of outliers using the algorithm outlined above. The p-by-p variance-covariance matrix of the model parameters cov is estimated as \sigma^2 (X^T X)^{-1}, where \sigma is an approximation of the residual standard deviation using the theory of robust regression. Special care must be taken when estimating \sigma and other statistics such as R^2, and so these are computed internally and are available by calling the function gsl_multifit_robust_statistics.

Function: int gsl_multifit_robust_est (const gsl_vector * x, const gsl_vector * c, const gsl_matrix * cov, double * y, double * y_err)
This function uses the best-fit robust regression coefficients c and their covariance matrix cov to compute the fitted function value y and its standard deviation y_err for the model y = x.c at the point x.

Function: gsl_multifit_robust_stats gsl_multifit_robust_statistics (const gsl_multifit_robust_workspace * w)
This function returns a structure containing relevant statistics from a robust regression. The function gsl_multifit_robust must be called first to perform the regression and calculate these statistics. The returned gsl_multifit_robust_stats structure contains the following fields.

Troubleshooting

When using models based on polynomials, care should be taken when constructing the design matrix X. If the x values are large, then the matrix X could be ill-conditioned since its columns are powers of x, leading to unstable least-squares solutions. In this case it can often help to center and scale the x values using the mean and standard deviation:

x' = (x - mu)/sigma

and then construct the X matrix using the transformed values x'.

Examples

The following program computes a least squares straight-line fit to a simple dataset, and outputs the best-fit line and its associated one standard-deviation error bars.

#include <stdio.h>
#include <gsl/gsl_fit.h>

int
main (void)
{
  int i, n = 4;
  double x[4] = { 1970, 1980, 1990, 2000 };
  double y[4] = {   12,   11,   14,   13 };
  double w[4] = {  0.1,  0.2,  0.3,  0.4 };

  double c0, c1, cov00, cov01, cov11, chisq;

  gsl_fit_wlinear (x, 1, w, 1, y, 1, n, 
                   &c0, &c1, &cov00, &cov01, &cov11, 
                   &chisq);

  printf ("# best fit: Y = %g + %g X\n", c0, c1);
  printf ("# covariance matrix:\n");
  printf ("# [ %g, %g\n#   %g, %g]\n", 
          cov00, cov01, cov01, cov11);
  printf ("# chisq = %g\n", chisq);

  for (i = 0; i < n; i++)
    printf ("data: %g %g %g\n", 
                   x[i], y[i], 1/sqrt(w[i]));

  printf ("\n");

  for (i = -30; i < 130; i++)
    {
      double xf = x[0] + (i/100.0) * (x[n-1] - x[0]);
      double yf, yf_err;

      gsl_fit_linear_est (xf, 
                          c0, c1, 
                          cov00, cov01, cov11, 
                          &yf, &yf_err);

      printf ("fit: %g %g\n", xf, yf);
      printf ("hi : %g %g\n", xf, yf + yf_err);
      printf ("lo : %g %g\n", xf, yf - yf_err);
    }
  return 0;
}

The following commands extract the data from the output of the program and display it using the GNU plotutils graph utility,

$ ./demo > tmp
$ more tmp
# best fit: Y = -106.6 + 0.06 X
# covariance matrix:
# [ 39602, -19.9
#   -19.9, 0.01]
# chisq = 0.8

$ for n in data fit hi lo ; 
   do 
     grep "^$n" tmp | cut -d: -f2 > $n ; 
   done
$ graph -T X -X x -Y y -y 0 20 -m 0 -S 2 -Ie data 
     -S 0 -I a -m 1 fit -m 2 hi -m 2 lo

The next program performs a quadratic fit y = c_0 + c_1 x + c_2 x^2 to a weighted dataset using the generalised linear fitting function gsl_multifit_wlinear. The model matrix X for a quadratic fit is given by,

X = [ 1   , x_0  , x_0^2 ;
      1   , x_1  , x_1^2 ;
      1   , x_2  , x_2^2 ;
      ... , ...  , ...   ]

where the column of ones corresponds to the constant term c_0. The two remaining columns corresponds to the terms c_1 x and c_2 x^2.

The program reads n lines of data in the format (x, y, err) where err is the error (standard deviation) in the value y.

#include <stdio.h>
#include <gsl/gsl_multifit.h>

int
main (int argc, char **argv)
{
  int i, n;
  double xi, yi, ei, chisq;
  gsl_matrix *X, *cov;
  gsl_vector *y, *w, *c;

  if (argc != 2)
    {
      fprintf (stderr,"usage: fit n < data\n");
      exit (-1);
    }

  n = atoi (argv[1]);

  X = gsl_matrix_alloc (n, 3);
  y = gsl_vector_alloc (n);
  w = gsl_vector_alloc (n);

  c = gsl_vector_alloc (3);
  cov = gsl_matrix_alloc (3, 3);

  for (i = 0; i < n; i++)
    {
      int count = fscanf (stdin, "%lg %lg %lg",
                          &xi, &yi, &ei);

      if (count != 3)
        {
          fprintf (stderr, "error reading file\n");
          exit (-1);
        }

      printf ("%g %g +/- %g\n", xi, yi, ei);
      
      gsl_matrix_set (X, i, 0, 1.0);
      gsl_matrix_set (X, i, 1, xi);
      gsl_matrix_set (X, i, 2, xi*xi);
      
      gsl_vector_set (y, i, yi);
      gsl_vector_set (w, i, 1.0/(ei*ei));
    }

  {
    gsl_multifit_linear_workspace * work 
      = gsl_multifit_linear_alloc (n, 3);
    gsl_multifit_wlinear (X, w, y, c, cov,
                          &chisq, work);
    gsl_multifit_linear_free (work);
  }

#define C(i) (gsl_vector_get(c,(i)))
#define COV(i,j) (gsl_matrix_get(cov,(i),(j)))

  {
    printf ("# best fit: Y = %g + %g X + %g X^2\n", 
            C(0), C(1), C(2));

    printf ("# covariance matrix:\n");
    printf ("[ %+.5e, %+.5e, %+.5e  \n",
               COV(0,0), COV(0,1), COV(0,2));
    printf ("  %+.5e, %+.5e, %+.5e  \n", 
               COV(1,0), COV(1,1), COV(1,2));
    printf ("  %+.5e, %+.5e, %+.5e ]\n", 
               COV(2,0), COV(2,1), COV(2,2));
    printf ("# chisq = %g\n", chisq);
  }

  gsl_matrix_free (X);
  gsl_vector_free (y);
  gsl_vector_free (w);
  gsl_vector_free (c);
  gsl_matrix_free (cov);

  return 0;
}

A suitable set of data for fitting can be generated using the following program. It outputs a set of points with gaussian errors from the curve y = e^x in the region 0 < x < 2.

#include <stdio.h>
#include <math.h>
#include <gsl/gsl_randist.h>

int
main (void)
{
  double x;
  const gsl_rng_type * T;
  gsl_rng * r;
  
  gsl_rng_env_setup ();
  
  T = gsl_rng_default;
  r = gsl_rng_alloc (T);

  for (x = 0.1; x < 2; x+= 0.1)
    {
      double y0 = exp (x);
      double sigma = 0.1 * y0;
      double dy = gsl_ran_gaussian (r, sigma);

      printf ("%g %g %g\n", x, y0 + dy, sigma);
    }

  gsl_rng_free(r);

  return 0;
}

The data can be prepared by running the resulting executable program,

$ GSL_RNG_TYPE=mt19937_1999 ./generate > exp.dat
$ more exp.dat
0.1 0.97935 0.110517
0.2 1.3359 0.12214
0.3 1.52573 0.134986
0.4 1.60318 0.149182
0.5 1.81731 0.164872
0.6 1.92475 0.182212
....

To fit the data use the previous program, with the number of data points given as the first argument. In this case there are 19 data points.

$ ./fit 19 < exp.dat
0.1 0.97935 +/- 0.110517
0.2 1.3359 +/- 0.12214
...
# best fit: Y = 1.02318 + 0.956201 X + 0.876796 X^2
# covariance matrix:
[ +1.25612e-02, -3.64387e-02, +1.94389e-02  
  -3.64387e-02, +1.42339e-01, -8.48761e-02  
  +1.94389e-02, -8.48761e-02, +5.60243e-02 ]
# chisq = 23.0987

The parameters of the quadratic fit match the coefficients of the expansion of e^x, taking into account the errors on the parameters and the O(x^3) difference between the exponential and quadratic functions for the larger values of x. The errors on the parameters are given by the square-root of the corresponding diagonal elements of the covariance matrix. The chi-squared per degree of freedom is 1.4, indicating a reasonable fit to the data.

The next program demonstrates the advantage of robust least squares on a dataset with outliers. The program generates linear (x,y) data pairs on the line y = 1.45 x + 3.88, adds some random noise, and inserts 3 outliers into the dataset. Both the robust and ordinary least squares (OLS) coefficients are computed for comparison.

#include <stdio.h>
#include <gsl/gsl_multifit.h>
#include <gsl/gsl_randist.h>

int
dofit(const gsl_multifit_robust_type *T,
      const gsl_matrix *X, const gsl_vector *y,
      gsl_vector *c, gsl_matrix *cov)
{
  int s;
  gsl_multifit_robust_workspace * work 
    = gsl_multifit_robust_alloc (T, X->size1, X->size2);

  s = gsl_multifit_robust (X, y, c, cov, work);
  gsl_multifit_robust_free (work);

  return s;
}

int
main (int argc, char **argv)
{
  int i;
  size_t n;
  const size_t p = 2; /* linear fit */
  gsl_matrix *X, *cov;
  gsl_vector *x, *y, *c, *c_ols;
  const double a = 1.45; /* slope */
  const double b = 3.88; /* intercept */
  gsl_rng *r;

  if (argc != 2)
    {
      fprintf (stderr,"usage: robfit n\n");
      exit (-1);
    }

  n = atoi (argv[1]);

  X = gsl_matrix_alloc (n, p);
  x = gsl_vector_alloc (n);
  y = gsl_vector_alloc (n);

  c = gsl_vector_alloc (p);
  c_ols = gsl_vector_alloc (p);
  cov = gsl_matrix_alloc (p, p);

  r = gsl_rng_alloc(gsl_rng_default);

  /* generate linear dataset */
  for (i = 0; i < n - 3; i++)
    {
      double dx = 10.0 / (n - 1.0);
      double ei = gsl_rng_uniform(r);
      double xi = -5.0 + i * dx;
      double yi = a * xi + b;

      gsl_vector_set (x, i, xi);
      gsl_vector_set (y, i, yi + ei);
    }

  /* add a few outliers */
  gsl_vector_set(x, n - 3, 4.7);
  gsl_vector_set(y, n - 3, -8.3);

  gsl_vector_set(x, n - 2, 3.5);
  gsl_vector_set(y, n - 2, -6.7);

  gsl_vector_set(x, n - 1, 4.1);
  gsl_vector_set(y, n - 1, -6.0);

  /* construct design matrix X for linear fit */
  for (i = 0; i < n; ++i)
    {
      double xi = gsl_vector_get(x, i);

      gsl_matrix_set (X, i, 0, 1.0);
      gsl_matrix_set (X, i, 1, xi);
    }

  /* perform robust and OLS fit */
  dofit(gsl_multifit_robust_ols, X, y, c_ols, cov);
  dofit(gsl_multifit_robust_bisquare, X, y, c, cov);

  /* output data and model */
  for (i = 0; i < n; ++i)
    {
      double xi = gsl_vector_get(x, i);
      double yi = gsl_vector_get(y, i);
      gsl_vector_view v = gsl_matrix_row(X, i);
      double y_ols, y_rob, y_err;

      gsl_multifit_robust_est(&v.vector, c, cov, &y_rob, &y_err);
      gsl_multifit_robust_est(&v.vector, c_ols, cov, &y_ols, &y_err);

      printf("%g %g %g %g\n", xi, yi, y_rob, y_ols);
    }

#define C(i) (gsl_vector_get(c,(i)))
#define COV(i,j) (gsl_matrix_get(cov,(i),(j)))

  {
    printf ("# best fit: Y = %g + %g X\n", 
            C(0), C(1));

    printf ("# covariance matrix:\n");
    printf ("# [ %+.5e, %+.5e\n",
               COV(0,0), COV(0,1));
    printf ("#   %+.5e, %+.5e\n", 
               COV(1,0), COV(1,1));
  }

  gsl_matrix_free (X);
  gsl_vector_free (x);
  gsl_vector_free (y);
  gsl_vector_free (c);
  gsl_vector_free (c_ols);
  gsl_matrix_free (cov);
  gsl_rng_free(r);

  return 0;
}

The output from the program is shown in the following plot.

References and Further Reading

A summary of formulas and techniques for least squares fitting can be found in the "Statistics" chapter of the Annual Review of Particle Physics prepared by the Particle Data Group,

The Review of Particle Physics is available online at the website given above.

The tests used to prepare these routines are based on the NIST Statistical Reference Datasets. The datasets and their documentation are available from NIST at the following website,

http://www.nist.gov/itl/div898/strd/index.html.

The GSL implementation of robust linear regression closely follows the publications


Go to the first, previous, next, last section, table of contents.