stats::reg

Regression (general linear and nonlinear least squares fit)

Use only in the MuPAD Notebook Interface.

This functionality does not run in MATLAB.

Syntax

stats::reg([x1, 1, …,xk, 1], …,[x1, m, …,xk, m], [y1, …,yk], <[w1, …,wk]>, f, [x1, xm], [p1, …,pn], <StartingValues = [p1(0), …,pn(0)]>, <CovarianceMatrix>)
stats::reg([[x1, 1, …,x1, m, y1, <w1>], …,[xk, 1, …,xk, m, yk, <wk>]], f, [x1, …,xm], [p1, …,pm], <StartingValues = [p1(0), …,pn(0)]>, <CovarianceMatrix>)
stats::reg(s, c1, …,cm, cy, <cw>, f, [x1, …,xm], [p1, …,pn], <StartingValues = [p1(0), …,pn(0)]>, <CovarianceMatrix>)
stats::reg(s, [c1, cm], cy, <cw>, f, [x1, …,xm], [p1, …,pn], <StartingValues = [p1(0), …,pn(0)]>, <CovarianceMatrix>)

Description

Consider a "model function" f with n parameters p1, …, pn relating a dependent variable y and m independent variables x1, …, xm: y = f(x1, …, xm, p1, …, pn). Given k different measurements x1 j, …, xkj for the independent variables xj and corresponding measurements y1, …, yk for the dependent variable y, one fits the parameters p1, …, pn by minimizing the "weighted quadratic deviation" ("chi-squared")

.

stats::reg(..data.., f, [x.1, ... , x.m], [p.1, ... , p.n], [w.1, ..., w.n]) computes numerical approximations of the fit parameters p1, …, pn.

All data must be convertible to real or complex floating-point values via float.

The number of measurements k must not be less than the number n of parameters pi.

The model function f may be non-linear in the independent variables xi and the fit parameters pi. E.g., a model function such as p1 + p2*x1^2 + exp(p3 + p4*x2) with the independent variables x1, x2 and the fit parameters p1, p2, p3, p4 is accepted.

Note that the fitting of model functions with a non-linear dependence on the parameters pi is much more costly than a linear regression, where the pi enter linearly. The functional dependence of the model function on the variables xi is of no relevance.

    Note:   There are rare cases where the implemented algorithm converges to a local minimum rather than to a global minimum. In particular, this problem may arise when the model involves periodic functions. It is recommended to provide suitable starting values for the fit parameters in this case. Cf. Example 4.

External statistical data stored in an ASCII file can be imported into a MuPAD® session via import::readdata. In particular, see Example 1 of the corresponding help page.

Environment Interactions

The function is sensitive to the environment variable DIGITS, which determines the numerical working precision.

Examples

Example 1

We fit a linear function y = p1 + p2x1 to four data pairs (xi1, yi) given by two lists:

stats::reg([0, 1, 2, 3], [1, 3, 5, 7],
           p1 + p2*x1, [x1], [p1, p2])

The parameter values p1 = 1.0, p2 = 2.0 provide a perfect fit: up to numerical round-off, the quadratic deviation vanishes.

Example 2

We fit an exponential function y = aebx to five data pairs (xi, yi). Weights are used to decrease the influence of the "exceptional pair" (x, y) = (5.0, 6.5×106) on the fit:

stats::reg([[1.1, 54, 1], [1.2, 73, 1], [1.3, 98, 1], 
            [1.4, 133, 1], [5.0, 6.5*10^6, 10^(-4)]], 
           a*exp(b*x), [x], [a, b])

Example 3

We create a sample with four columns. The first column is a counter labeling the measurements. This column is of no further relevance here. The second and third column provide measured data of two variables x1 and x2, respectively. The last column provides corresponding measurements of a dependent variable.

s := stats::sample([[1, 0, 0,  1.1], [2, 0, 1,  5.4], 
                    [3, 1, 1,  8.5], [4, 1, 2, 18.5], 
                    [5, 2, 1, 15.0], [6, 2, 2, 24.8]])
1  0  0   1.1
2  0  1   5.4
3  1  1   8.5
4  1  2  18.5
5  2  1  15.0
6  2  2  24.8

First, we try to model the data provided by the columns 2, 3, 4 by a function that is linear in the variables x1, x2. We specify the data columns by a list of column indices:

stats::reg(s, [2, 3, 4], p1 + p2*x1 + p3*x2, 
           [x1, x2], [p1, p2, p3])

The quadratic deviation is rather large, indicating that a linear function is inappropriate to fit the data. Next, we extend the model and consider a polynomial fit function of degree 2. This is still a linear regression problem, because the fit parameters enter the model function linearly. We specify the data columns by a sequence of column indices:

stats::reg(s, 2, 3, 4,
           p1 + p2*x1 + p3*x2 + p4*x1^2 + p5*x2^2, 
           [x1, x2], [p1, p2, p3, p4, p5])

Finally, we include a further term p6*x1*x2 in the model, obtaining a perfect fit:

stats::reg(s, 2, 3, 4, 
           p1 + p2*x1 + p3*x2 + p4*x1^2 + p5*x2^2 + p6*x1*x2, 
           [x1, x2], [p1, p2, p3, p4, p5, p6])

delete s:

Example 4

We create a sample of two columns:

s := stats::sample([[1, -1.44], [2, -0.82], 
                    [3, 0.97], [4, 1.37]])
1  -1.44
2  -0.82
3   0.97
4   1.37

The data are to be modeled by a function of the form y = p1sin(p2x), where the first column contains measurements of x and the second column contains corresponding data for y. Note that in this example there is no need to specify column indices, because the sample contains only two columns:

stats::reg(s, a*sin(b*x), [x], [a, b])

Fitting a periodic function may be problematic. We provide starting values for the fit parameters and obtain a quite different set of parameters approximating the data with the same quality:

stats::reg(s, a*sin(b*x), [x], [a, b], StartingValues = [2, 5])

delete s:

Example 5

The blood sugar level y (in mmol/L) of a diabetic is measured over a period of 10 days with 5 measurements per day at x1 = 7 (o'clock a.m.), x1 = 12 (noon), x1 = 15 (afternoon), x1 = 19 (before dinner), and x1 = 23 (bed time). These are the measurements:

Y:=  //hour: 7    12   15   19   23
    [     [ 7.2, 5.5, 6.8, 5.4, 6.0], // day 1
          [ 6.3, 5.0, 5.5, 5.8, 4.9], // day 2
          [ 6.5, 6.3, 4.8, 4.5, 5.0], // day 3
          [ 4.3, 5.2, 4.3, 4.7, 4.0], // day 4
          [ 7.1, 7.2, 6.7, 7.2, 5.5], // day 5
          [ 5.8, 5.5, 4.9, 5.0, 6.2], // day 6
          [ 6.2, 4.8, 5.0, 5.2, 5.3], // day 7
          [ 4.8, 5.8, 5.7, 6.2, 5.0], // day 8
          [ 5.2, 3.8, 4.8, 5.8, 4.7], // day 9
          [ 5.8, 4.7, 5.0, 6.5, 6.3]  // day 10
    ]:

We have a total of 50 measurements. Each measurement is a triple [x1, x2, y], where x1 is the hour of the day, x2 is the day number, and y is the blood sugar level:

data:= [([ 7, x2, Y[x2][1]], 
         [12, x2, Y[x2][2]], 
         [15, x2, Y[x2][3]], 
         [19, x2, Y[x2][4]], 
         [23, x2, Y[x2][5]]
        ) $ x2 = 1 .. 10]:

We model the blood sugar y as a function of the hour of the day x1 and the day number x2 (trying to detect a general tendency). We assume a periodic dependence on x1 with a period of 24 hours:

y := y0 + a*x2 + b*sin(2*PI/24*x1 + c):

A least squares fit of the given data leads to the following parameters y0, a, b, c:

[y0abc, residue]:= stats::reg(data, y, [x1, x2], [y0, a, b, c]):
[y0, a, b, c]:= y0abc

The average blood sugar level is y0 = with an improvement of a = per day. The amplitude of the daily variation of y is b = . We visualize the measurements Y by a plot::Matrixplot. The least squares fit of our model function y is added as a function graph:

plot(plot::Matrixplot(Y, x1 = 0..24, x2 = 1..10),
     plot::Function3d(y, x1 = 0..24, x2 = 1..10,
                      Color = RGB::Green))

delete Y, data, y, y0abc, y0, a, b, c, residue:

Example 6

We consider a decaying radioactive source, whose activity N ("counts") is measured at intervals of 1 second. The physical model for the decay is , where N(t) is the count rate at time t, N0 is the base rate at time t = 0 and τ is the lifetime of the radioactive source. Instead of taking data from an actual physical experiment, we create artificial data with a base rate N0 = 100 and a lifetime τ = 300:

T := [i $ i= 0 .. 100]:
N := [100*exp(-t/300) $ t in T]:

By construction, we obtain a perfect fit when estimating the paramaters N0 and τ of the model:

stats::reg(T, N, N0*exp(-t/tau), [t], [N0, tau]);

We perturb the data:

N := [stats::poissonRandom(n)() $ n in N]:

Since the data ni in N are Poissonian, their standard deviation is the square root of their mean: . Thus, suitable weights for a least squares estimation of the parameters are given by :

W := [1/n $ n in N]:

With these weights, a least squares fit of the model parameters N0 and τ is computed. The option CovarianceMatrix is used to get information on confidence intervals for the parameters:

[p, chisquared, C] :=
    stats::reg(T, N, W, N0*exp(-t/tau), [t], [N0, tau],
               CovarianceMatrix)

The square roots of the diagonal elements of the covariance matrix provides the statistical standard deviations of the fit parameters:

sqrt(float(C[1,1])), sqrt(float(C[2,2]))

Thus, the estimate for the base rate N0 is , the estimate for the lifetime τ is . The correlation matrix of the fit parameters is obtained from the covariance matrix via stats::correlationMatrix:

stats::correlationMatrix(C)

delete T, N, W, p, chisquared, C:

Parameters

x1, 1, …,xk, m

Numerical sample data for the independent variables. The entry xi, j represents the i-th measurement of the independent variable xj.

y1, …,yk

Numerical sample data for the dependent variable. The entry yi represents the i-th measurement of the dependent variable.

w1, …,wk

Weight factors: positive real numerical values. The entry wi is used as a weight for the data xi, 1, …, xi, m, yi of the i-th measurement. If no weights are provided, then wi = 1 is used.

f

The model function: an arithmetical expression representing a function of the independent variables x1, …, xm and the fit parameters p1, …, pn. The expression must not contain any symbolic objects apart from x1, …, xm, p1, …, pn.

x1, …,xm

The independent variables: identifiers or indexed identifiers.

p1, …,pn

The fit parameters: identifiers or indexed identifiers.

p1(0), …,pn(0)

The user can assist the internal numerical search by providing numerical starting values pi(0) for the fit parameters pi. These should be reasonably close to the optimal fit values. The starting values pi(0) = 1.0 are used if no starting values are provided by the user.

s

A sample of domain type stats::sample containing the data xi, j for the independent variables, the data yi for the dependent variable and, optionally, the weights wi.

cy

A positive integer representing a column index of the sample s. This column provides the measurements yi for the dependent variable.

cw

A positive integer representing a column index of the sample s. This column provides the weight factors wi.

Options

StartingValues

Option, specified as StartingValues = [p1(0), …,pn(0)]

Positive integers representing column indices of the sample s. Column pj provides the measurements xi, j for the independent variable xj.

If the model function depends linearly on the fit parameters pj ("linear regression"), then the optimized parameters are the solution of a linear system of equations. In this case there is no need to provide starting values for a numerical search. In fact, initial values provided by the user are ignored.

If the model function depends non-linearly on the fit parameters pj ("non-linear regression"), then the optimized fitting parameters are the solution of a non-linear optimization problem. There is no guarantee that the internal search for a numerical solution will succeed. It is recommended to assist the internal solver by providing reasonably good estimates for the optimal fit parameters.

CovarianceMatrix

Changes the return value from [[p1, …, pn], χ2] to [[p1, …, pn], χ2, C], where C is the covariance matrix of the estimators pi given by Ci, i = σ(pi)2 and Ci, j = cov(pi, pj) for ij.

With this option, information on confidence intervals for the least squares estimators pi are provided. In particular, the return value includes the covariance matrix C of type Dom::Matrix(). This matrix provides the variances Cii = σ(pi)2 of the least squares estimators pi and their covariances Cij = cov(pi, pj). The covariance matrix is defined via its inverse

,

Where

.

The covariance matrix of the least squares estimators only has a statistical meaning if the stochastic variances σ(yi)2 of the measurements yi are known. These variances are to be included in the computation by choosing the weights . Cf. Example 6.

The function stats::correlationMatrix serves for converting the covariance matrix to the corresponding correlation matrix. See Example 6

Return Values

Without the option CovarianceMatrix, a list [[p1, …, pn], χ2] is returned. It contains the optimized fit parameters pi minimizing the quadratic deviation. The minimized value of this deviation is given by χ2, it indicates the quality of the fit.

With the option CovarianceMatrix, a list [[p1, …, pn], χ2, C] is returned. The n×n matrix C is the covariance matrix of the fit parameters.

All returned data are floating-point values. FAIL is returned if a least square fit of the data is not possible with the given model function or if the internal numerical search failed.

Algorithms

stats::reg uses a Marquardt-Levenberg gradient expansion algorithm. Searching for the minimum of , the algorithm does not simply follow the negative gradient, but the diagonal terms of the curvature matrix are increased by a factor that is optimized in each step of the search.

References

P.R. Bevington and D.K. Robinson, "Data Reduction and Error Analysis for The Physical Sciences", McGraw-Hill, New York, 1992.

Was this topic helpful?