Documentation |
Regression (general linear and nonlinear least squares fit)
This functionality does not run in MATLAB.
stats::reg([x_{1, 1}, …,x_{k, 1}], …,[x_{1, m}, …,x_{k, m}], [y_{1}, …,y_{k}], <[w_{1}, …,w_{k}]>, f, [x_{1}, x_{m}], [p_{1}, …,p_{n}], <StartingValues = [p_{1}(0), …,p_{n}(0)]>, <CovarianceMatrix>) stats::reg([[x_{1, 1}, …,x_{1, m}, y_{1}, <w_{1}>], …,[x_{k, 1}, …,x_{k, m}, y_{k}, <w_{k}>]], f, [x_{1}, …,x_{m}], [p_{1}, …,p_{m}], <StartingValues = [p_{1}(0), …,p_{n}(0)]>, <CovarianceMatrix>) stats::reg(s, c_{1}, …,c_{m}, cy, <cw>, f, [x_{1}, …,x_{m}], [p_{1}, …,p_{n}], <StartingValues = [p_{1}(0), …,p_{n}(0)]>, <CovarianceMatrix>) stats::reg(s, [c_{1}, c_{m}], cy, <cw>, f, [x_{1}, …,x_{m}], [p_{1}, …,p_{n}], <StartingValues = [p_{1}(0), …,p_{n}(0)]>, <CovarianceMatrix>)
Consider a "model function" f with n parameters p_{1}, …, p_{n} relating a dependent variable y and m independent variables x_{1}, …, x_{m}: y = f(x_{1}, …, x_{m}, p_{1}, …, p_{n}). Given k different measurements x_{1 j}, …, x_{kj} for the independent variables x_{j} and corresponding measurements y_{1}, …, y_{k} for the dependent variable y, one fits the parameters p_{1}, …, p_{n} by minimizing the "weighted quadratic deviation" ("chi-squared")
.
stats::reg(..data.., f, [x.1, ... , x.m], [p.1, ... , p.n], [w.1, ..., w.n]) computes numerical approximations of the fit parameters p_{1}, …, p_{n}.
All data must be convertible to real or complex floating-point values via float.
The number of measurements k must not be less than the number n of parameters p_{i}.
The model function f may be non-linear in the independent variables x_{i} and the fit parameters p_{i}. E.g., a model function such as p1 + p2*x1^2 + exp(p3 + p4*x2) with the independent variables x1, x2 and the fit parameters p1, p2, p3, p4 is accepted.
Note that the fitting of model functions with a non-linear dependence on the parameters p_{i} is much more costly than a linear regression, where the p_{i} enter linearly. The functional dependence of the model function on the variables x_{i} is of no relevance.
Note: There are rare cases where the implemented algorithm converges to a local minimum rather than to a global minimum. In particular, this problem may arise when the model involves periodic functions. It is recommended to provide suitable starting values for the fit parameters in this case. Cf. Example 4. |
External statistical data stored in an ASCII file can be imported into a MuPAD^{®} session via import::readdata. In particular, see Example 1 of the corresponding help page.
The function is sensitive to the environment variable DIGITS, which determines the numerical working precision.
We fit a linear function y = p_{1} + p_{2} x_{1} to four data pairs (x_{i1}, y_{i}) given by two lists:
stats::reg([0, 1, 2, 3], [1, 3, 5, 7], p1 + p2*x1, [x1], [p1, p2])
The parameter values p_{1} = 1.0, p_{2} = 2.0 provide a perfect fit: up to numerical round-off, the quadratic deviation vanishes.
We fit an exponential function y = a e^{b x} to five data pairs (x_{i}, y_{i}). Weights are used to decrease the influence of the "exceptional pair" (x, y) = (5.0, 6.5×10^{6}) on the fit:
stats::reg([[1.1, 54, 1], [1.2, 73, 1], [1.3, 98, 1], [1.4, 133, 1], [5.0, 6.5*10^6, 10^(-4)]], a*exp(b*x), [x], [a, b])
We create a sample with four columns. The first column is a counter labeling the measurements. This column is of no further relevance here. The second and third column provide measured data of two variables x_{1} and x_{2}, respectively. The last column provides corresponding measurements of a dependent variable.
s := stats::sample([[1, 0, 0, 1.1], [2, 0, 1, 5.4], [3, 1, 1, 8.5], [4, 1, 2, 18.5], [5, 2, 1, 15.0], [6, 2, 2, 24.8]])
1 0 0 1.1 2 0 1 5.4 3 1 1 8.5 4 1 2 18.5 5 2 1 15.0 6 2 2 24.8
First, we try to model the data provided by the columns 2, 3, 4 by a function that is linear in the variables x_{1}, x_{2}. We specify the data columns by a list of column indices:
stats::reg(s, [2, 3, 4], p1 + p2*x1 + p3*x2, [x1, x2], [p1, p2, p3])
The quadratic deviation is rather large, indicating that a linear function is inappropriate to fit the data. Next, we extend the model and consider a polynomial fit function of degree 2. This is still a linear regression problem, because the fit parameters enter the model function linearly. We specify the data columns by a sequence of column indices:
stats::reg(s, 2, 3, 4, p1 + p2*x1 + p3*x2 + p4*x1^2 + p5*x2^2, [x1, x2], [p1, p2, p3, p4, p5])
Finally, we include a further term p6*x1*x2 in the model, obtaining a perfect fit:
stats::reg(s, 2, 3, 4, p1 + p2*x1 + p3*x2 + p4*x1^2 + p5*x2^2 + p6*x1*x2, [x1, x2], [p1, p2, p3, p4, p5, p6])
delete s:
We create a sample of two columns:
s := stats::sample([[1, -1.44], [2, -0.82], [3, 0.97], [4, 1.37]])
1 -1.44 2 -0.82 3 0.97 4 1.37
The data are to be modeled by a function of the form y = p_{1} sin(p_{2} x), where the first column contains measurements of x and the second column contains corresponding data for y. Note that in this example there is no need to specify column indices, because the sample contains only two columns:
stats::reg(s, a*sin(b*x), [x], [a, b])
Fitting a periodic function may be problematic. We provide starting values for the fit parameters and obtain a quite different set of parameters approximating the data with the same quality:
stats::reg(s, a*sin(b*x), [x], [a, b], StartingValues = [2, 5])
delete s:
The blood sugar level y (in mmol/L) of a diabetic is measured over a period of 10 days with 5 measurements per day at x1 = 7 (o'clock a.m.), x1 = 12 (noon), x1 = 15 (afternoon), x1 = 19 (before dinner), and x1 = 23 (bed time). These are the measurements:
Y:= //hour: 7 12 15 19 23 [ [ 7.2, 5.5, 6.8, 5.4, 6.0], // day 1 [ 6.3, 5.0, 5.5, 5.8, 4.9], // day 2 [ 6.5, 6.3, 4.8, 4.5, 5.0], // day 3 [ 4.3, 5.2, 4.3, 4.7, 4.0], // day 4 [ 7.1, 7.2, 6.7, 7.2, 5.5], // day 5 [ 5.8, 5.5, 4.9, 5.0, 6.2], // day 6 [ 6.2, 4.8, 5.0, 5.2, 5.3], // day 7 [ 4.8, 5.8, 5.7, 6.2, 5.0], // day 8 [ 5.2, 3.8, 4.8, 5.8, 4.7], // day 9 [ 5.8, 4.7, 5.0, 6.5, 6.3] // day 10 ]:
We have a total of 50 measurements. Each measurement is a triple [x1, x2, y], where x1 is the hour of the day, x2 is the day number, and y is the blood sugar level:
data:= [([ 7, x2, Y[x2][1]], [12, x2, Y[x2][2]], [15, x2, Y[x2][3]], [19, x2, Y[x2][4]], [23, x2, Y[x2][5]] ) $ x2 = 1 .. 10]:
We model the blood sugar y as a function of the hour of the day x1 and the day number x2 (trying to detect a general tendency). We assume a periodic dependence on x1 with a period of 24 hours:
y := y0 + a*x2 + b*sin(2*PI/24*x1 + c):
A least squares fit of the given data leads to the following parameters y0, a, b, c:
[y0abc, residue]:= stats::reg(data, y, [x1, x2], [y0, a, b, c]): [y0, a, b, c]:= y0abc
The average blood sugar level is y0 = with an improvement of a = per day. The amplitude of the daily variation of y is b = . We visualize the measurements Y by a plot::Matrixplot. The least squares fit of our model function y is added as a function graph:
plot(plot::Matrixplot(Y, x1 = 0..24, x2 = 1..10), plot::Function3d(y, x1 = 0..24, x2 = 1..10, Color = RGB::Green))
delete Y, data, y, y0abc, y0, a, b, c, residue:
We consider a decaying radioactive source, whose activity N ("counts") is measured at intervals of 1 second. The physical model for the decay is , where N(t) is the count rate at time t, N_{0} is the base rate at time t = 0 and τ is the lifetime of the radioactive source. Instead of taking data from an actual physical experiment, we create artificial data with a base rate N_{0} = 100 and a lifetime τ = 300:
T := [i $ i= 0 .. 100]: N := [100*exp(-t/300) $ t in T]:
By construction, we obtain a perfect fit when estimating the paramaters N_{0} and τ of the model:
stats::reg(T, N, N0*exp(-t/tau), [t], [N0, tau]);
We perturb the data:
N := [stats::poissonRandom(n)() $ n in N]:
Since the data n_{i} in N are Poissonian, their standard deviation is the square root of their mean: . Thus, suitable weights for a least squares estimation of the parameters are given by :
W := [1/n $ n in N]:
With these weights, a least squares fit of the model parameters N_{0} and τ is computed. The option CovarianceMatrix is used to get information on confidence intervals for the parameters:
[p, chisquared, C] := stats::reg(T, N, W, N0*exp(-t/tau), [t], [N0, tau], CovarianceMatrix)
The square roots of the diagonal elements of the covariance matrix provides the statistical standard deviations of the fit parameters:
sqrt(float(C[1,1])), sqrt(float(C[2,2]))
Thus, the estimate for the base rate N_{0} is , the estimate for the lifetime τ is . The correlation matrix of the fit parameters is obtained from the covariance matrix via stats::correlationMatrix:
stats::correlationMatrix(C)
delete T, N, W, p, chisquared, C:
x_{1, 1}, …,x_{k, m} |
Numerical sample data for the independent variables. The entry x_{i, j} represents the i-th measurement of the independent variable x_{j}. |
y_{1}, …,y_{k} |
Numerical sample data for the dependent variable. The entry y_{i} represents the i-th measurement of the dependent variable. |
w_{1}, …,w_{k} |
Weight factors: positive real numerical values. The entry w_{i} is used as a weight for the data x_{i, 1}, …, x_{i, m}, y_{i} of the i-th measurement. If no weights are provided, then w_{i} = 1 is used. |
f |
The model function: an arithmetical expression representing a function of the independent variables x_{1}, …, x_{m} and the fit parameters p_{1}, …, p_{n}. The expression must not contain any symbolic objects apart from x_{1}, …, x_{m}, p_{1}, …, p_{n}. |
x_{1}, …,x_{m} |
The independent variables: identifiers or indexed identifiers. |
p_{1}, …,p_{n} |
The fit parameters: identifiers or indexed identifiers. |
p_{1}(0), …,p_{n}(0) |
The user can assist the internal numerical search by providing numerical starting values p_{i}(0) for the fit parameters p_{i}. These should be reasonably close to the optimal fit values. The starting values p_{i}(0) = 1.0 are used if no starting values are provided by the user. |
s |
A sample of domain type stats::sample containing the data x_{i, j} for the independent variables, the data y_{i} for the dependent variable and, optionally, the weights w_{i}. |
cy |
A positive integer representing a column index of the sample s. This column provides the measurements y_{i} for the dependent variable. |
cw |
A positive integer representing a column index of the sample s. This column provides the weight factors w_{i}. |
StartingValues |
Option, specified as StartingValues = [p1(0), …,pn(0)] Positive integers representing column indices of the sample s. Column p_{j} provides the measurements x_{i, j} for the independent variable x_{j}. If the model function depends linearly on the fit parameters p_{j} ("linear regression"), then the optimized parameters are the solution of a linear system of equations. In this case there is no need to provide starting values for a numerical search. In fact, initial values provided by the user are ignored. If the model function depends non-linearly on the fit parameters p_{j} ("non-linear regression"), then the optimized fitting parameters are the solution of a non-linear optimization problem. There is no guarantee that the internal search for a numerical solution will succeed. It is recommended to assist the internal solver by providing reasonably good estimates for the optimal fit parameters. |
CovarianceMatrix |
Changes the return value from [[p_{1}, …, p_{n}], χ^{2}] to [[p_{1}, …, p_{n}], χ^{2}, C], where C is the covariance matrix of the estimators p_{i} given by C_{i, i} = σ(p_{i})^{2} and C_{i, j} = cov(p_{i}, p_{j}) for i ≠ j. With this option, information on confidence intervals for the least squares estimators p_{i} are provided. In particular, the return value includes the covariance matrix C of type Dom::Matrix(). This matrix provides the variances C_{ii} = σ(p_{i})^{2} of the least squares estimators p_{i} and their covariances C_{ij} = cov(p_{i}, p_{j}). The covariance matrix is defined via its inverse , Where . The covariance matrix of the least squares estimators only has a statistical meaning if the stochastic variances σ(y_{i})^{2} of the measurements y_{i} are known. These variances are to be included in the computation by choosing the weights . Cf. Example 6. The function stats::correlationMatrix serves for converting the covariance matrix to the corresponding correlation matrix. See Example 6 |
Without the option CovarianceMatrix, a list [[p_{1}, …, p_{n}], χ^{2}] is returned. It contains the optimized fit parameters p_{i} minimizing the quadratic deviation. The minimized value of this deviation is given by χ^{2}, it indicates the quality of the fit.
With the option CovarianceMatrix, a list [[p_{1}, …, p_{n}], χ^{2}, C] is returned. The n×n matrix C is the covariance matrix of the fit parameters.
All returned data are floating-point values. FAIL is returned if a least square fit of the data is not possible with the given model function or if the internal numerical search failed.
stats::reg uses a Marquardt-Levenberg gradient expansion algorithm. Searching for the minimum of , the algorithm does not simply follow the negative gradient, but the diagonal terms of the curvature matrix are increased by a factor that is optimized in each step of the search.