Go to the first, previous, next, last section, table of contents.
This chapter describes routines for performing least squares fits to experimental data using linear combinations of functions. The data may be weighted or unweighted, i.e. with known or unknown errors. For weighted data the functions compute the best fit parameters and their associated covariance matrix. For unweighted data the covariance matrix is estimated from the scatter of the points, giving a variance-covariance matrix.
The functions are divided into separate versions for simple one- or two-parameter regression and multiple-parameter fits. The functions are declared in the header file `gsl_fit.h'.
Least-squares fits are found by minimizing \chi^2 (chi-squared), the weighted sum of squared residuals over n experimental datapoints (x_i, y_i) for the model Y(c,x),
\chi^2 = \sum_i w_i (y_i - Y(c, x_i))^2
The p parameters of the model are c = {c_0, c_1, ...}. The weight factors w_i are given by w_i = 1/\sigma_i^2, where \sigma_i is the experimental error on the data-point y_i. The errors are assumed to be Gaussian and uncorrelated. For unweighted data the chi-squared sum is computed without any weight factors.
The fitting routines return the best-fit parameters c and their p \times p covariance matrix. The covariance matrix measures the statistical errors on the best-fit parameters resulting from the errors on the data, \sigma_i, and is defined as C_{ab} = <\delta c_a \delta c_b> where < > denotes an average over the Gaussian error distributions of the underlying datapoints.
The covariance matrix is calculated by error propagation from the data errors \sigma_i. The change in a fitted parameter \delta c_a caused by a small change in the data \delta y_i is given by
\delta c_a = \sum_i (dc_a/dy_i) \delta y_i
allowing the covariance matrix to be written in terms of the errors on the data,
C_{ab} = \sum_{i,j} (dc_a/dy_i) (dc_b/dy_j) <\delta y_i \delta y_j>
For uncorrelated data the fluctuations of the underlying datapoints satisfy <\delta y_i \delta y_j> = \sigma_i^2 \delta_{ij}, giving a corresponding parameter covariance matrix of
C_{ab} = \sum_i (1/w_i) (dc_a/dy_i) (dc_b/dy_i)
When computing the covariance matrix for unweighted data, i.e. data with unknown errors, the weight factors w_i in this sum are replaced by the single estimate w = 1/\sigma^2, where \sigma^2 is the computed variance of the residuals about the best-fit model, \sigma^2 = \sum (y_i - Y(c,x_i))^2 / (n-p). This is referred to as the variance-covariance matrix.
The standard deviations of the best-fit parameters are given by the square root of the corresponding diagonal elements of the covariance matrix, \sigma_{c_a} = \sqrt{C_{aa}}. The correlation coefficient of the fit parameters c_a and c_b is given by \rho_{ab} = C_{ab} / \sqrt{C_{aa} C_{bb}}.
The functions described in this section can be used to perform least-squares fits to a straight line model, Y(c,x) = c_0 + c_1 x.
gsl_stats_correlation
(see section Correlation), it does not depend on the fit.
The covariance matrix for the parameters (c0, c1) is computed using the weights and returned via the parameters (cov00, cov01, cov11). The weighted sum of squares of the residuals from the best-fit line, \chi^2, is returned in chisq.
The functions described in this section can be used to perform least-squares fits to a straight line model without a constant term, Y = c_1 X.
The variance of the parameter c1 is computed using the weights and returned via the parameter cov11. The weighted sum of squares of the residuals from the best-fit line, \chi^2, is returned in chisq.
The functions described in this section perform least-squares fits to a general linear model, y = X c where y is a vector of n observations, X is an n by p matrix of predictor variables, and the elements of the vector c are the p unknown best-fit parameters which are to be estimated. The chi-squared value is given by \chi^2 = \sum_i w_i (y_i - \sum_j X_{ij} c_j)^2.
This formulation can be used for fits to any number of functions and/or variables by preparing the n-by-p matrix X appropriately. For example, to fit to a p-th order polynomial in x, use the following matrix,
X_{ij} = x_i^j
where the index i runs over the observations and the index j runs from 0 to p-1.
To fit to a set of p sinusoidal functions with fixed frequencies \omega_1, \omega_2, ..., \omega_p, use,
X_{ij} = sin(\omega_j x_i)
To fit to p independent variables x_1, x_2, ..., x_p, use,
X_{ij} = x_j(i)
where x_j(i) is the i-th value of the predictor variable x_j.
The functions described in this section are declared in the header file `gsl_multifit.h'.
The solution of the general linear least-squares system requires an additional working space for intermediate results, such as the singular value decomposition of the matrix X.
gsl_stats_tss
.
The best-fit is found by singular value decomposition of the matrix X using the modified Golub-Reinsch SVD algorithm, with column scaling to improve the accuracy of the singular values. Any components which have zero singular value (to machine precision) are discarded from the fit.
gsl_stats_wtss
.
Ordinary least squares (OLS) models are often heavily influenced by the presence of outliers. Outliers are data points which do not follow the general trend of the other observations, although there is strictly no precise definition of an outlier. Robust linear regression refers to regression algorithms which are robust to outliers. The most common type of robust regression is M-estimation. The general M-estimator minimizes the objective function
\sum_i \rho(e_i) = \sum_i \rho (y_i - Y(c, x_i))
where e_i = y_i - Y(c, x_i) is the residual of the ith data point, and \rho(e_i) is a function which should have the following properties:
The special case of ordinary least squares is given by \rho(e_i) = e_i^2. Letting \psi = \rho' be the derivative of \rho, differentiating the objective function with respect to the coefficients c and setting the partial derivatives to zero produces the system of equations
\sum_i \psi(e_i) X_i = 0
where X_i is a vector containing row i of the design matrix X. Next, we define a weight function w(e) = \psi(e)/e, and let w_i = w(e_i):
\sum_i w_i e_i X_i = 0
This system of equations is equivalent to solving a weighted ordinary least squares problem, minimizing \chi^2 = \sum_i w_i e_i^2. The weights however, depend on the residuals e_i, which depend on the coefficients c, which depend on the weights. Therefore, an iterative solution is used, called Iteratively Reweighted Least Squares (IRLS).
|c_i^(k) - c_i^(k-1)| \le \epsilon \times max(|c_i^(k)|, |c_i^(k-1)|)for all 0 \le i < p where \epsilon is a small tolerance factor.
The key to this method lies in selecting the function \psi(e_i) to assign smaller weights to large residuals, and larger weights to smaller residuals. As the iteration proceeds, outliers are assigned smaller and smaller weights, eventually having very little or no effect on the fitted model.
gsl_multifit_robust_bisquare
type (see below) and is a good
general purpose choice for robust regression.
w(e) = (1 - e^2)^2
and the default tuning constant is t = 4.685.
w(e) = 1 / (1 + e^2)
and the default tuning constant is t = 2.385.
w(e) = 1 / (1 + |e|)
and the default tuning constant is t = 1.400.
w(e) = 1/max(1,|e|)
and the default tuning constant is t = 1.345.
w(e) = 1
and the default tuning constant is t = 1.
w(e) = \exp(-e^2)
and the default tuning constant is t = 2.985.
gsl_multifit_robust_alloc
.
gsl_multifit_robust_statistics
.
gsl_multifit_robust
must be called first to perform the regression and calculate these statistics.
The returned gsl_multifit_robust_stats
structure contains the following fields.
sigma_ols
This contains the standard deviation of the residuals as computed from ordinary least squares (OLS).
sigma_mad
This contains an estimate of the standard deviation of the final residuals using the Median-Absolute-Deviation statistic
sigma_rob
This contains an estimate of the standard deviation of the final residuals from the theory of robust regression (see Street et al, 1988).
sigma
This contains an estimate of the standard deviation of the final residuals by attemping to reconcile sigma_rob
and sigma_ols
in a reasonable way.
Rsq
This contains the R^2 coefficient of determination statistic using the estimate sigma
.
adj_Rsq
This contains the adjusted R^2 coefficient of determination statistic using the estimate sigma
.
rmse
This contains the root mean squared error of the final residuals
sse
This contains the residual sum of squares taking into account the robust covariance matrix.
dof
This contains the number of degrees of freedom n - p
numit
Upon successful convergence, this contains the number of iterations performed
weights
This contains the final weight vector of length n
r
This contains the final residual vector of length n, r = y - X c
When using models based on polynomials, care should be taken when constructing the design matrix X. If the x values are large, then the matrix X could be ill-conditioned since its columns are powers of x, leading to unstable least-squares solutions. In this case it can often help to center and scale the x values using the mean and standard deviation:
x' = (x - mu)/sigma
and then construct the X matrix using the transformed values x'.
The following program computes a least squares straight-line fit to a simple dataset, and outputs the best-fit line and its associated one standard-deviation error bars.
#include <stdio.h> #include <gsl/gsl_fit.h> int main (void) { int i, n = 4; double x[4] = { 1970, 1980, 1990, 2000 }; double y[4] = { 12, 11, 14, 13 }; double w[4] = { 0.1, 0.2, 0.3, 0.4 }; double c0, c1, cov00, cov01, cov11, chisq; gsl_fit_wlinear (x, 1, w, 1, y, 1, n, &c0, &c1, &cov00, &cov01, &cov11, &chisq); printf ("# best fit: Y = %g + %g X\n", c0, c1); printf ("# covariance matrix:\n"); printf ("# [ %g, %g\n# %g, %g]\n", cov00, cov01, cov01, cov11); printf ("# chisq = %g\n", chisq); for (i = 0; i < n; i++) printf ("data: %g %g %g\n", x[i], y[i], 1/sqrt(w[i])); printf ("\n"); for (i = -30; i < 130; i++) { double xf = x[0] + (i/100.0) * (x[n-1] - x[0]); double yf, yf_err; gsl_fit_linear_est (xf, c0, c1, cov00, cov01, cov11, &yf, &yf_err); printf ("fit: %g %g\n", xf, yf); printf ("hi : %g %g\n", xf, yf + yf_err); printf ("lo : %g %g\n", xf, yf - yf_err); } return 0; }
The following commands extract the data from the output of the program
and display it using the GNU plotutils graph
utility,
$ ./demo > tmp $ more tmp # best fit: Y = -106.6 + 0.06 X # covariance matrix: # [ 39602, -19.9 # -19.9, 0.01] # chisq = 0.8 $ for n in data fit hi lo ; do grep "^$n" tmp | cut -d: -f2 > $n ; done $ graph -T X -X x -Y y -y 0 20 -m 0 -S 2 -Ie data -S 0 -I a -m 1 fit -m 2 hi -m 2 lo
The next program performs a quadratic fit y = c_0 + c_1 x + c_2
x^2 to a weighted dataset using the generalised linear fitting function
gsl_multifit_wlinear
. The model matrix X for a quadratic
fit is given by,
X = [ 1 , x_0 , x_0^2 ; 1 , x_1 , x_1^2 ; 1 , x_2 , x_2^2 ; ... , ... , ... ]
where the column of ones corresponds to the constant term c_0. The two remaining columns corresponds to the terms c_1 x and c_2 x^2.
The program reads n lines of data in the format (x, y, err) where err is the error (standard deviation) in the value y.
#include <stdio.h> #include <gsl/gsl_multifit.h> int main (int argc, char **argv) { int i, n; double xi, yi, ei, chisq; gsl_matrix *X, *cov; gsl_vector *y, *w, *c; if (argc != 2) { fprintf (stderr,"usage: fit n < data\n"); exit (-1); } n = atoi (argv[1]); X = gsl_matrix_alloc (n, 3); y = gsl_vector_alloc (n); w = gsl_vector_alloc (n); c = gsl_vector_alloc (3); cov = gsl_matrix_alloc (3, 3); for (i = 0; i < n; i++) { int count = fscanf (stdin, "%lg %lg %lg", &xi, &yi, &ei); if (count != 3) { fprintf (stderr, "error reading file\n"); exit (-1); } printf ("%g %g +/- %g\n", xi, yi, ei); gsl_matrix_set (X, i, 0, 1.0); gsl_matrix_set (X, i, 1, xi); gsl_matrix_set (X, i, 2, xi*xi); gsl_vector_set (y, i, yi); gsl_vector_set (w, i, 1.0/(ei*ei)); } { gsl_multifit_linear_workspace * work = gsl_multifit_linear_alloc (n, 3); gsl_multifit_wlinear (X, w, y, c, cov, &chisq, work); gsl_multifit_linear_free (work); } #define C(i) (gsl_vector_get(c,(i))) #define COV(i,j) (gsl_matrix_get(cov,(i),(j))) { printf ("# best fit: Y = %g + %g X + %g X^2\n", C(0), C(1), C(2)); printf ("# covariance matrix:\n"); printf ("[ %+.5e, %+.5e, %+.5e \n", COV(0,0), COV(0,1), COV(0,2)); printf (" %+.5e, %+.5e, %+.5e \n", COV(1,0), COV(1,1), COV(1,2)); printf (" %+.5e, %+.5e, %+.5e ]\n", COV(2,0), COV(2,1), COV(2,2)); printf ("# chisq = %g\n", chisq); } gsl_matrix_free (X); gsl_vector_free (y); gsl_vector_free (w); gsl_vector_free (c); gsl_matrix_free (cov); return 0; }
A suitable set of data for fitting can be generated using the following program. It outputs a set of points with gaussian errors from the curve y = e^x in the region 0 < x < 2.
#include <stdio.h> #include <math.h> #include <gsl/gsl_randist.h> int main (void) { double x; const gsl_rng_type * T; gsl_rng * r; gsl_rng_env_setup (); T = gsl_rng_default; r = gsl_rng_alloc (T); for (x = 0.1; x < 2; x+= 0.1) { double y0 = exp (x); double sigma = 0.1 * y0; double dy = gsl_ran_gaussian (r, sigma); printf ("%g %g %g\n", x, y0 + dy, sigma); } gsl_rng_free(r); return 0; }
The data can be prepared by running the resulting executable program,
$ GSL_RNG_TYPE=mt19937_1999 ./generate > exp.dat $ more exp.dat 0.1 0.97935 0.110517 0.2 1.3359 0.12214 0.3 1.52573 0.134986 0.4 1.60318 0.149182 0.5 1.81731 0.164872 0.6 1.92475 0.182212 ....
To fit the data use the previous program, with the number of data points given as the first argument. In this case there are 19 data points.
$ ./fit 19 < exp.dat 0.1 0.97935 +/- 0.110517 0.2 1.3359 +/- 0.12214 ... # best fit: Y = 1.02318 + 0.956201 X + 0.876796 X^2 # covariance matrix: [ +1.25612e-02, -3.64387e-02, +1.94389e-02 -3.64387e-02, +1.42339e-01, -8.48761e-02 +1.94389e-02, -8.48761e-02, +5.60243e-02 ] # chisq = 23.0987
The parameters of the quadratic fit match the coefficients of the expansion of e^x, taking into account the errors on the parameters and the O(x^3) difference between the exponential and quadratic functions for the larger values of x. The errors on the parameters are given by the square-root of the corresponding diagonal elements of the covariance matrix. The chi-squared per degree of freedom is 1.4, indicating a reasonable fit to the data.
The next program demonstrates the advantage of robust least squares on a dataset with outliers. The program generates linear (x,y) data pairs on the line y = 1.45 x + 3.88, adds some random noise, and inserts 3 outliers into the dataset. Both the robust and ordinary least squares (OLS) coefficients are computed for comparison.
#include <stdio.h> #include <gsl/gsl_multifit.h> #include <gsl/gsl_randist.h> int dofit(const gsl_multifit_robust_type *T, const gsl_matrix *X, const gsl_vector *y, gsl_vector *c, gsl_matrix *cov) { int s; gsl_multifit_robust_workspace * work = gsl_multifit_robust_alloc (T, X->size1, X->size2); s = gsl_multifit_robust (X, y, c, cov, work); gsl_multifit_robust_free (work); return s; } int main (int argc, char **argv) { int i; size_t n; const size_t p = 2; /* linear fit */ gsl_matrix *X, *cov; gsl_vector *x, *y, *c, *c_ols; const double a = 1.45; /* slope */ const double b = 3.88; /* intercept */ gsl_rng *r; if (argc != 2) { fprintf (stderr,"usage: robfit n\n"); exit (-1); } n = atoi (argv[1]); X = gsl_matrix_alloc (n, p); x = gsl_vector_alloc (n); y = gsl_vector_alloc (n); c = gsl_vector_alloc (p); c_ols = gsl_vector_alloc (p); cov = gsl_matrix_alloc (p, p); r = gsl_rng_alloc(gsl_rng_default); /* generate linear dataset */ for (i = 0; i < n - 3; i++) { double dx = 10.0 / (n - 1.0); double ei = gsl_rng_uniform(r); double xi = -5.0 + i * dx; double yi = a * xi + b; gsl_vector_set (x, i, xi); gsl_vector_set (y, i, yi + ei); } /* add a few outliers */ gsl_vector_set(x, n - 3, 4.7); gsl_vector_set(y, n - 3, -8.3); gsl_vector_set(x, n - 2, 3.5); gsl_vector_set(y, n - 2, -6.7); gsl_vector_set(x, n - 1, 4.1); gsl_vector_set(y, n - 1, -6.0); /* construct design matrix X for linear fit */ for (i = 0; i < n; ++i) { double xi = gsl_vector_get(x, i); gsl_matrix_set (X, i, 0, 1.0); gsl_matrix_set (X, i, 1, xi); } /* perform robust and OLS fit */ dofit(gsl_multifit_robust_ols, X, y, c_ols, cov); dofit(gsl_multifit_robust_bisquare, X, y, c, cov); /* output data and model */ for (i = 0; i < n; ++i) { double xi = gsl_vector_get(x, i); double yi = gsl_vector_get(y, i); gsl_vector_view v = gsl_matrix_row(X, i); double y_ols, y_rob, y_err; gsl_multifit_robust_est(&v.vector, c, cov, &y_rob, &y_err); gsl_multifit_robust_est(&v.vector, c_ols, cov, &y_ols, &y_err); printf("%g %g %g %g\n", xi, yi, y_rob, y_ols); } #define C(i) (gsl_vector_get(c,(i))) #define COV(i,j) (gsl_matrix_get(cov,(i),(j))) { printf ("# best fit: Y = %g + %g X\n", C(0), C(1)); printf ("# covariance matrix:\n"); printf ("# [ %+.5e, %+.5e\n", COV(0,0), COV(0,1)); printf ("# %+.5e, %+.5e\n", COV(1,0), COV(1,1)); } gsl_matrix_free (X); gsl_vector_free (x); gsl_vector_free (y); gsl_vector_free (c); gsl_vector_free (c_ols); gsl_matrix_free (cov); gsl_rng_free(r); return 0; }
The output from the program is shown in the following plot.
A summary of formulas and techniques for least squares fitting can be found in the "Statistics" chapter of the Annual Review of Particle Physics prepared by the Particle Data Group,
The Review of Particle Physics is available online at the website given above.
The tests used to prepare these routines are based on the NIST Statistical Reference Datasets. The datasets and their documentation are available from NIST at the following website,
http://www.nist.gov/itl/div898/strd/index.html.
The GSL implementation of robust linear regression closely follows the publications
Go to the first, previous, next, last section, table of contents.