10.8.1 Linear Fits

The a F (calc-curve-fit) [fit] command attempts to fit a set of data (‘x’ and ‘y’ vectors of numbers) to a straight line, polynomial, or other function of ‘x’. For the moment we will consider only the case of fitting to a line, and we will ignore the issue of whether or not the model was in fact a good fit for the data.

In a standard linear least-squares fit, we have a set of ‘(x,y)’ data points that we wish to fit to the model ‘y = m x + b’ by adjusting the parameters ‘m’ and ‘b’ to make the ‘y’ values calculated from the formula be as close as possible to the actual ‘y’ values in the data set. (In a polynomial fit, the model is instead, say, ‘y = a x^3 + b x^2 + c x + d’. In a multilinear fit, we have data points of the form ‘(x_1,x_2,x_3,y)’ and our model is ‘y = a x_1 + b x_2 + c x_3 + d’. These will be discussed later.)

In the model formula, variables like ‘x’ and ‘x_2’ are called the independent variables, and ‘y’ is the dependent variable. Variables like ‘m’, ‘a’, and ‘b’ are called the parameters of the model.

The a F command takes the data set to be fitted from the stack. By default, it expects the data in the form of a matrix. For example, for a linear or polynomial fit, this would be a 2xN matrix where the first row is a list of ‘x’ values and the second row has the corresponding ‘y’ values. For the multilinear fit shown above, the matrix would have four rows (‘x_1’, ‘x_2’, ‘x_3’, and ‘y’, respectively).

If you happen to have an Nx2 matrix instead of a 2xN matrix, just press v t first to transpose the matrix.

After you type a F, Calc prompts you to select a model. For a linear fit, press the digit 1.

Calc then prompts for you to name the variables. By default it chooses high letters like ‘x’ and ‘y’ for independent variables and low letters like ‘a’ and ‘b’ for parameters. (The dependent variable doesn’t need a name.) The two kinds of variables are separated by a semicolon. Since you generally care more about the names of the independent variables than of the parameters, Calc also allows you to name only those and let the parameters use default names.

For example, suppose the data matrix

[ [ 1, 2, 3, 4,  5  ]
  [ 5, 7, 9, 11, 13 ] ]

is on the stack and we wish to do a simple linear fit. Type a F, then 1 for the model, then RET to use the default names. The result will be the formula ‘3. + 2. x’ on the stack. Calc has created the model expression a + b x, then found the optimal values of ‘a’ and ‘b’ to fit the data. (In this case, it was able to find an exact fit.) Calc then substituted those values for ‘a’ and ‘b’ in the model formula.

The a F command puts two entries in the trail. One is, as always, a copy of the result that went to the stack; the other is a vector of the actual parameter values, written as equations: ‘[a = 3, b = 2]’, in case you’d rather read them in a list than pick them out of the formula. (You can type t y to move this vector to the stack; see Trail Commands.

Specifying a different independent variable name will affect the resulting formula: a F 1 k RET produces 3 + 2 k. Changing the parameter names (say, a F 1 k;b,m RET) will affect the equations that go into the trail.

To see what happens when the fit is not exact, we could change the number 13 in the data matrix to 14 and try the fit again. The result is:

2.6 + 2.2 x

Evaluating this formula, say with v x 5 RET TAB V M $ RET, shows a reasonably close match to the y-values in the data.

[4.8, 7., 9.2, 11.4, 13.6]

Since there is no line which passes through all the n data points, Calc has chosen a line that best approximates the data points using the method of least squares. The idea is to define the chi-square error measure

chi^2 = sum((y_i - (a + b x_i))^2, i, 1, N)

which is clearly zero if ‘a + b x’ exactly fits all data points, and increases as various ‘a + b x_i’ values fail to match the corresponding ‘y_i’ values. There are several reasons why the summand is squared, one of them being to ensure that ‘chi^2 >= 0’. Least-squares fitting simply chooses the values of ‘a’ and ‘b’ for which the error ‘chi^2’ is as small as possible.

Other kinds of models do the same thing but with a different model formula in place of ‘a + b x_i’.

A numeric prefix argument causes the a F command to take the data in some other form than one big matrix. A positive argument n will take N items from the stack, corresponding to the n rows of a data matrix. In the linear case, n must be 2 since there is always one independent variable and one dependent variable.

A prefix of zero or plain C-u is a compromise; Calc takes two items from the stack, an n-row matrix of ‘x’ values, and a vector of ‘y’ values. If there is only one independent variable, the ‘x’ values can be either a one-row matrix or a plain vector, in which case the C-u prefix is the same as a C-u 2 prefix.