Regression (general linear and nonlinear least squares fit)
MuPAD® notebooks are not recommended. Use MATLAB® live scripts instead.
MATLAB live scripts support most MuPAD functionality, though there are some differences. For more information, see Convert MuPAD Notebooks to MATLAB Live Scripts.
stats::reg([x_{1, 1}, …,x_{k, 1}], …,[x_{1, m}, …,x_{k, m}]
,[y_{1}, …,y_{k}]
, <[w_{1}, …,w_{k}]
>,f
,[x_{1}, x_{m}]
,[p_{1}, …,p_{n}]
, <StartingValues = [p_{1}(0), …,p_{n}(0)]
>, <CovarianceMatrix>) stats::reg([[x_{1, 1}, …,x_{1, m}, y_{1}, <w_{1}>], …,[x_{k, 1}, …,x_{k, m}, y_{k}, <w_{k}>]]
,f
,[x_{1}, …,x_{m}]
,[p_{1}, …,p_{m}]
, <StartingValues = [p_{1}(0), …,p_{n}(0)]
>, <CovarianceMatrix>) stats::reg(s
,c_{1}, …,c_{m}
,cy
, <cw
>,f
,[x_{1}, …,x_{m}]
,[p_{1}, …,p_{n}]
, <StartingValues = [p_{1}(0), …,p_{n}(0)]
>, <CovarianceMatrix>) stats::reg(s
,[c_{1}, c_{m}]
,cy
, <cw
>,f
,[x_{1}, …,x_{m}]
,[p_{1}, …,p_{n}]
, <StartingValues = [p_{1}(0), …,p_{n}(0)]
>, <CovarianceMatrix>)
Consider a "model function" f with n parameters p_{1}, …, p_{n} relating a dependent variable y and m independent variables x_{1}, …, x_{m}: y = f(x_{1}, …, x_{m}, p_{1}, …, p_{n}). Given k different measurements x_{1 j}, …, x_{kj} for the independent variables x_{j} and corresponding measurements y_{1}, …, y_{k} for the dependent variable y, one fits the parameters p_{1}, …, p_{n} by minimizing the "weighted quadratic deviation" ("chisquared")
.
stats::reg(..data.., f, [x.1, ... , x.m], [p.1, ...
, p.n], [w.1, ..., w.n])
computes numerical approximations
of the fit parameters p_{1},
…, p_{n}.
All data must be convertible to real or complex floatingpoint
values via float
.
The number of measurements k must not be less than the number n of parameters p_{i}.
The model function f may
be nonlinear in the independent variables x_{i} and
the fit parameters p_{i}.
E.g., a model function such as p1 + p2*x1^2 + exp(p3 + p4*x2)
with
the independent variables x1, x2
and the fit parameters p1,
p2, p3, p4
is accepted.
Note that the fitting of model functions with a nonlinear dependence on the parameters p_{i} is much more costly than a linear regression, where the p_{i} enter linearly. The functional dependence of the model function on the variables x_{i} is of no relevance.
Note: There are rare cases where the implemented algorithm converges to a local minimum rather than to a global minimum. In particular, this problem may arise when the model involves periodic functions. It is recommended to provide suitable starting values for the fit parameters in this case. Cf. Example 4. 
External statistical data stored in an ASCII file can be imported
into a MuPAD^{®} session via import::readdata
. In particular, see
Example 1 of the corresponding help page.
The function is sensitive to the environment variable DIGITS
,
which determines the numerical working precision.
We fit a linear function y = p_{1} + p_{2} x_{1} to four data pairs (x_{i1}, y_{i}) given by two lists:
stats::reg([0, 1, 2, 3], [1, 3, 5, 7], p1 + p2*x1, [x1], [p1, p2])
The parameter values p_{1} = 1.0, p_{2} = 2.0 provide a perfect fit: up to numerical roundoff, the quadratic deviation vanishes.
We fit an exponential function y = a e^{b x} to five data pairs (x_{i}, y_{i}). Weights are used to decrease the influence of the "exceptional pair" (x, y) = (5.0, 6.5×10^{6}) on the fit:
stats::reg([[1.1, 54, 1], [1.2, 73, 1], [1.3, 98, 1], [1.4, 133, 1], [5.0, 6.5*10^6, 10^(4)]], a*exp(b*x), [x], [a, b])
We create a sample with four columns. The first column is a counter labeling the measurements. This column is of no further relevance here. The second and third column provide measured data of two variables x_{1} and x_{2}, respectively. The last column provides corresponding measurements of a dependent variable.
s := stats::sample([[1, 0, 0, 1.1], [2, 0, 1, 5.4], [3, 1, 1, 8.5], [4, 1, 2, 18.5], [5, 2, 1, 15.0], [6, 2, 2, 24.8]])
1 0 0 1.1 2 0 1 5.4 3 1 1 8.5 4 1 2 18.5 5 2 1 15.0 6 2 2 24.8
First, we try to model the data provided by the columns 2, 3, 4 by a function that is linear in the variables x_{1}, x_{2}. We specify the data columns by a list of column indices:
stats::reg(s, [2, 3, 4], p1 + p2*x1 + p3*x2, [x1, x2], [p1, p2, p3])
The quadratic deviation is rather large, indicating that a linear function is inappropriate to fit the data. Next, we extend the model and consider a polynomial fit function of degree 2. This is still a linear regression problem, because the fit parameters enter the model function linearly. We specify the data columns by a sequence of column indices:
stats::reg(s, 2, 3, 4, p1 + p2*x1 + p3*x2 + p4*x1^2 + p5*x2^2, [x1, x2], [p1, p2, p3, p4, p5])
Finally, we include a further term p6*x1*x2
in
the model, obtaining a perfect fit:
stats::reg(s, 2, 3, 4, p1 + p2*x1 + p3*x2 + p4*x1^2 + p5*x2^2 + p6*x1*x2, [x1, x2], [p1, p2, p3, p4, p5, p6])
delete s:
We create a sample of two columns:
s := stats::sample([[1, 1.44], [2, 0.82], [3, 0.97], [4, 1.37]])
1 1.44 2 0.82 3 0.97 4 1.37
The data are to be modeled by a function of the form y = p_{1} sin(p_{2} x), where the first column contains measurements of x and the second column contains corresponding data for y. Note that in this example there is no need to specify column indices, because the sample contains only two columns:
stats::reg(s, a*sin(b*x), [x], [a, b])
Fitting a periodic function may be problematic. We provide starting values for the fit parameters and obtain a quite different set of parameters approximating the data with the same quality:
stats::reg(s, a*sin(b*x), [x], [a, b], StartingValues = [2, 5])
delete s:
The blood sugar level y
(in mmol/L) of a
diabetic is measured over a period of 10 days with 5 measurements
per day at x1
= 7 (o'clock a.m.), x1
=
12 (noon), x1
= 15 (afternoon), x1
=
19 (before dinner), and x1
= 23 (bed time). These
are the measurements:
Y:= //hour: 7 12 15 19 23 [ [ 7.2, 5.5, 6.8, 5.4, 6.0], // day 1 [ 6.3, 5.0, 5.5, 5.8, 4.9], // day 2 [ 6.5, 6.3, 4.8, 4.5, 5.0], // day 3 [ 4.3, 5.2, 4.3, 4.7, 4.0], // day 4 [ 7.1, 7.2, 6.7, 7.2, 5.5], // day 5 [ 5.8, 5.5, 4.9, 5.0, 6.2], // day 6 [ 6.2, 4.8, 5.0, 5.2, 5.3], // day 7 [ 4.8, 5.8, 5.7, 6.2, 5.0], // day 8 [ 5.2, 3.8, 4.8, 5.8, 4.7], // day 9 [ 5.8, 4.7, 5.0, 6.5, 6.3] // day 10 ]:
We have a total of 50 measurements. Each measurement is a triple [x1,
x2, y]
, where x1 is
the hour of the day, x2 is
the day number, and y
is the blood sugar level:
data:= [([ 7, x2, Y[x2][1]], [12, x2, Y[x2][2]], [15, x2, Y[x2][3]], [19, x2, Y[x2][4]], [23, x2, Y[x2][5]] ) $ x2 = 1 .. 10]:
We model the blood sugar y
as a function
of the hour of the day x1
and the day number x2
(trying
to detect a general tendency). We assume a periodic dependence on x1
with
a period of 24 hours:
y := y0 + a*x2 + b*sin(2*PI/24*x1 + c):
A least squares fit of the given data leads to the following
parameters y0, a, b, c
:
[y0abc, residue]:= stats::reg(data, y, [x1, x2], [y0, a, b, c]): [y0, a, b, c]:= y0abc
The average blood sugar level is y0
=
with
an improvement of a
=
per
day. The amplitude of the daily variation of y
is b
=
.
We visualize the measurements Y
by a plot::Matrixplot
. The
least squares fit of our model function y
is added
as a function graph:
plot(plot::Matrixplot(Y, x1 = 0..24, x2 = 1..10), plot::Function3d(y, x1 = 0..24, x2 = 1..10, Color = RGB::Green))
delete Y, data, y, y0abc, y0, a, b, c, residue:
We consider a decaying radioactive source, whose activity N ("counts") is measured at intervals of 1 second. The physical model for the decay is , where N(t) is the count rate at time t, N_{0} is the base rate at time t = 0 and τ is the lifetime of the radioactive source. Instead of taking data from an actual physical experiment, we create artificial data with a base rate N_{0} = 100 and a lifetime τ = 300:
T := [i $ i= 0 .. 100]: N := [100*exp(t/300) $ t in T]:
By construction, we obtain a perfect fit when estimating the paramaters N_{0} and τ of the model:
stats::reg(T, N, N0*exp(t/tau), [t], [N0, tau]);
We perturb the data:
N := [stats::poissonRandom(n)() $ n in N]:
Since the data n_{i} in N are Poissonian, their standard deviation is the square root of their mean: . Thus, suitable weights for a least squares estimation of the parameters are given by :
W := [1/n $ n in N]:
With these weights, a least squares fit of the model parameters N_{0} and τ is
computed. The option CovarianceMatrix
is used to
get information on confidence intervals for the parameters:
[p, chisquared, C] := stats::reg(T, N, W, N0*exp(t/tau), [t], [N0, tau], CovarianceMatrix)
The square roots of the diagonal elements of the covariance matrix provides the statistical standard deviations of the fit parameters:
sqrt(float(C[1,1])), sqrt(float(C[2,2]))
Thus, the estimate for the base rate N_{0} is
,
the estimate for the lifetime τ is
.
The correlation matrix of the fit parameters is obtained from the
covariance matrix via stats::correlationMatrix
:
stats::correlationMatrix(C)
delete T, N, W, p, chisquared, C:

Numerical sample data for the independent variables. The entry x_{i, j} represents the ith measurement of the independent variable x_{j}. 

Numerical sample data for the dependent variable. The entry y_{i} represents the ith measurement of the dependent variable. 

Weight factors: positive real numerical values. The entry w_{i} is used as a weight for the data x_{i, 1}, …, x_{i, m}, y_{i} of the ith measurement. If no weights are provided, then w_{i} = 1 is used. 

The model function: an arithmetical expression representing a function of the independent variables x_{1}, …, x_{m} and the fit parameters p_{1}, …, p_{n}. The expression must not contain any symbolic objects apart from x_{1}, …, x_{m}, p_{1}, …, p_{n}. 

The independent variables: identifiers or indexed identifiers. 

The fit parameters: identifiers or indexed identifiers. 

The user can assist the internal numerical search by providing numerical starting values p_{i}(0) for the fit parameters p_{i}. These should be reasonably close to the optimal fit values. The starting values p_{i}(0) = 1.0 are used if no starting values are provided by the user. 

A sample of domain type 

A positive integer representing a column index of the sample 

A positive integer representing a column index of the sample 

Option, specified as Positive integers representing column indices of the sample If the model function depends linearly on the fit parameters p_{j} ("linear regression"), then the optimized parameters are the solution of a linear system of equations. In this case there is no need to provide starting values for a numerical search. In fact, initial values provided by the user are ignored. If the model function depends nonlinearly on the fit parameters p_{j} ("nonlinear regression"), then the optimized fitting parameters are the solution of a nonlinear optimization problem. There is no guarantee that the internal search for a numerical solution will succeed. It is recommended to assist the internal solver by providing reasonably good estimates for the optimal fit parameters. 

Changes the return value from [[p_{1}, …, p_{n}], χ^{2}] to [[p_{1}, …, p_{n}], χ^{2}, C], where C is the covariance matrix of the estimators p_{i} given by C_{i, i} = σ(p_{i})^{2} and C_{i, j} = cov(p_{i}, p_{j}) for i ≠ j. With this option, information on confidence intervals for the
least squares estimators p_{i} are
provided. In particular, the return value includes the covariance
matrix C of
type , Where . The covariance matrix of the least squares estimators only has a statistical meaning if the stochastic variances σ(y_{i})^{2} of the measurements y_{i} are known. These variances are to be included in the computation by choosing the weights . Cf. Example 6. The function 
Without the option CovarianceMatrix
, a list [[p_{1},
…, p_{n}], χ^{2}] is
returned. It contains the optimized fit parameters p_{i} minimizing
the quadratic deviation. The minimized value of this deviation is
given by χ^{2},
it indicates the quality of the fit.
With the option CovarianceMatrix
, a list [[p_{1},
…, p_{n}], χ^{2}, C] is
returned. The n×n matrix C
is the covariance
matrix of the fit parameters.
All returned data are floatingpoint values. FAIL
is
returned if a least square fit of the data is not possible with the
given model function or if the internal numerical search failed.
P.R. Bevington and D.K. Robinson, "Data Reduction and Error Analysis for The Physical Sciences", McGrawHill, New York, 1992.
stats::reg
uses a MarquardtLevenberg gradient
expansion algorithm. Searching for the minimum of
,
the algorithm does not simply follow the negative gradient, but the
diagonal terms of the curvature matrix are increased by a factor that
is optimized in each step of the search.