To begin fitting a regression, put your data into a form that
fitting functions expect. All regression techniques begin with input
data in an array X
and response data in a separate
vector y
, or input data in a table or dataset array tbl
and
response data as a column in tbl
. Each row of the
input data represents one observation. Each column represents one
predictor (variable).
For a table or dataset array tbl
, indicate
the response variable with the 'ResponseVar'
namevalue
pair:
mdl = fitlm(tbl,'ResponseVar','BloodPressure'); % or mdl = fitglm(tbl,'ResponseVar','BloodPressure');
The response variable is the last column by default.
You can use numeric categorical predictors. A categorical predictor is one that takes values from a fixed set of possibilities.
For a numeric array X
, indicate
the categorical predictors using the 'Categorical'
namevalue
pair. For example, to indicate that predictors 2
and 3
out
of six are categorical:
mdl = fitlm(X,y,'Categorical',[2,3]); % or mdl = fitglm(X,y,'Categorical',[2,3]); % or equivalently mdl = fitlm(X,y,'Categorical',logical([0 1 1 0 0 0]));
For a table or dataset array tbl
,
fitting functions assume that these data types are categorical:
Logical
Categorical (nominal or ordinal)
String or character array
If you want to indicate that a numeric predictor is categorical,
use the 'Categorical'
namevalue pair.
Represent missing numeric data as NaN
. To
represent missing data for other data types, see Missing Group Values.
To create a dataset array from an Excel^{®} spreadsheet:
ds = dataset('XLSFile','hospital.xls',... 'ReadObsNames',true);
To create a dataset array from workspace variables:
load carsmall
ds = dataset(MPG,Weight);
ds.Year = ordinal(Model_Year);
To create a table from an Excel spreadsheet:
tbl = readtable('hospital.xls',... 'ReadRowNames',true);
To create a table from workspace variables:
load carsmall
tbl = table(MPG,Weight);
tbl.Year = ordinal(Model_Year);
For example, to create numeric arrays from workspace variables:
load carsmall
X = [Weight Horsepower Cylinders Model_Year];
y = MPG;
To create numeric arrays from an Excel spreadsheet:
[X Xnames] = xlsread('hospital.xls'); y = X(:,4); % response y is systolic pressure X(:,4) = []; % remove y from the X matrix
Notice that the nonnumeric entries, such as sex
,
do not appear in X
.
There are three ways to fit a model to data:
Use fitlm
to construct
a leastsquares fit of a model to the data. This method is best when
you are reasonably certain of the model's form, and mainly
need to find its parameters. This method is also useful when you want
to explore a few models. The method requires you to examine the data
manually to discard outliers, though there are techniques to help
(see Residuals — Model Quality for Training Data).
Use fitlm
with the RobustOpts
namevalue
pair to create a model that is little affected by outliers. Robust
fitting saves you the trouble of manually discarding outliers. However, step
does not
work with robust fitting. This means that when you use robust fitting,
you cannot search stepwise for a good model.
Use stepwiselm
to find
a model, and fit parameters to the model. stepwiselm
starts
from one model, such as a constant, and adds or subtracts terms one
at a time, choosing an optimal term each time in a greedy fashion,
until it cannot improve further. Use stepwise fitting to find a good
model, which is one that has only relevant terms.
The result depends on the starting model. Usually, starting with a constant model leads to a small model. Starting with more terms can lead to a more complex model, but one that has lower mean squared error. See Compare large and small stepwise models.
You cannot use robust options along with stepwise fitting. So after a stepwise fit, examine your model for outliers (see Residuals — Model Quality for Training Data).
There are several ways of specifying a model for linear regression. Use whichever you find most convenient.
For fitlm
, the model specification
you give is the model that is fit. If you do not give a model specification,
the default is 'linear'
.
For stepwiselm
, the model
specification you give is the starting model, which the stepwise procedure
tries to improve. If you do not give a model specification, the default
starting model is 'constant'
, and the default upper
bounding model is 'interactions'
. Change the upper
bounding model using the Upper
namevalue pair.
Note:
There are other ways of selecting models, such as using 
String  Model Type 

'constant'  Model contains only a constant (intercept) term. 
'linear'  Model contains an intercept and linear terms for each predictor. 
'interactions'  Model contains an intercept, linear terms, and all products of pairs of distinct predictors (no squared terms). 
'purequadratic'  Model contains an intercept, linear terms, and squared terms. 
'quadratic'  Model contains an intercept, linear terms, interactions, and squared terms. 
'poly  Model is a polynomial with all terms up to degree i in
the first predictor, degree j in the second
predictor, etc. Use numerals 0 through 9 .
For example, 'poly2111' has a constant plus all
linear and product terms, and also contains terms with predictor 1
squared. 
For example, to specify an interaction model using fitlm
with matrix predictors:
mdl = fitlm(X,y,'interactions');
To specify a model using stepwiselm
and
a table or dataset array tbl
of predictors, suppose
you want to start from a constant and have a linear model upper bound.
Assume the response variable in tbl
is in the third
column.
mdl2 = stepwiselm(tbl,'constant',... 'Upper','linear','ResponseVar',3);
A terms matrix is a tby(p + 1) matrix specifying terms in a model, where t is the number of terms, p is the number of predictor variables, and plus one is for the response variable.
The value of T(i,j)
is the exponent of variable j
in
term i
. Suppose there are three predictor variables A
, B
,
and C
:
[0 0 0 0] % Constant term or intercept [0 1 0 0] % B; equivalently, A^0 * B^1 * C^0 [1 0 1 0] % A*C [2 0 0 0] % A^2 [0 1 2 0] % B*(C^2)
0
at the
end of each term represents the response variable. In general,If you have the variables in a table or dataset array,
then 0
must represent the response variable depending
on the position of the response variable. The following example illustrates
this.
Load the sample data and define the dataset array.
load hospital ds = dataset(hospital.Sex,hospital.BloodPressure(:,1),hospital.Age,... hospital.Smoker,'VarNames',{'Sex','BloodPressure','Age','Smoker'});
Represent the linear model 'BloodPressure ~ 1 + Sex
+ Age + Smoker'
in a terms matrix. The response variable
is in the second column of the dataset array, so there must be a column
of 0s for the response variable in the second column of the terms
matrix.
T = [0 0 0 0;1 0 0 0;0 0 1 0;0 0 0 1]
T = 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1
Redefine the dataset array.
ds = dataset(hospital.BloodPressure(:,1),hospital.Sex,hospital.Age,... hospital.Smoker,'VarNames',{'BloodPressure','Sex','Age','Smoker'});
Now, the response variable is the first term in the dataset
array. Specify the same linear model, 'BloodPressure ~ 1
+ Sex + Age + Smoker'
, using a terms matrix.
T = [0 0 0 0;0 1 0 0;0 0 1 0;0 0 0 1]
T = 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
If you have the predictor and response variables in
a matrix and column vector, then you must include 0
for
the response variable at the end of each term. The following example
illustrates this.
Load the sample data and define the matrix of predictors.
load carsmall
X = [Acceleration,Weight];
Specify the model 'MPG ~ Acceleration + Weight + Acceleration:Weight
+ Weight^2'
using a term matrix and fit the model to the
data. This model includes the main effect and twoway interaction
terms for the variables, Acceleration
and Weight
,
and a secondorder term for the variable, Weight
.
T = [0 0 0;1 0 0;0 1 0;1 1 0;0 2 0]
T = 0 0 0 1 0 0 0 1 0 1 1 0 0 2 0
Fit a linear model.
mdl = fitlm(X,MPG,T)
mdl = Linear regression model: y ~ 1 + x1*x2 + x2^2 Estimated Coefficients: Estimate SE tStat pValue (Intercept) 48.906 12.589 3.8847 0.00019665 x1 0.54418 0.57125 0.95261 0.34337 x2 0.012781 0.0060312 2.1192 0.036857 x1:x2 0.00010892 0.00017925 0.6076 0.545 x2^2 9.7518e07 7.5389e07 1.2935 0.19917 Number of observations: 94, Error degrees of freedom: 89 Root Mean Squared Error: 4.1 Rsquared: 0.751, Adjusted RSquared 0.739 Fstatistic vs. constant model: 67, pvalue = 4.99e26
Only the intercept and x2
term, which correspond
to the Weight
variable, are significant at the
5% significance level.
Now, perform a stepwise regression with a constant model as the starting model and a linear model with interactions as the upper model.
T = [0 0 0;1 0 0;0 1 0;1 1 0];
mdl = stepwiselm(X,MPG,[0 0 0],'upper',T)
1. Adding x2, FStat = 259.3087, pValue = 1.643351e28 mdl = Linear regression model: y ~ 1 + x2 Estimated Coefficients: Estimate SE tStat pValue (Intercept) 49.238 1.6411 30.002 2.7015e49 x2 0.0086119 0.0005348 16.103 1.6434e28 Number of observations: 94, Error degrees of freedom: 92 Root Mean Squared Error: 4.13 Rsquared: 0.738, Adjusted RSquared 0.735 Fstatistic vs. constant model: 259, pvalue = 1.64e28
The results of the stepwise regression are consistent with the
results of fitlm
in the previous step.
A formula for a model specification is a string of the form
'
,Y
~ terms
'
Y
is the response name.
terms
contains
Variable names
+
to include the next variable

to exclude the next variable
:
to define an interaction, a product
of terms
*
to define an interaction and
all lowerorder terms
^
to raise the predictor to a power,
exactly as in *
repeated, so ^
includes
lower order terms as well
()
to group terms
Tip
Formulas include a constant (intercept) term by default. To
exclude a constant term from the model, include 
Examples:
'Y ~ A + B + C'
is a threevariable
linear model with intercept.'Y ~ A + B +
C  1'
is a threevariable linear model without intercept.'Y ~ A + B + C + B^2'
is a threevariable
model with intercept and a B^2
term.'Y
~ A + B^2 + C'
is the same as the previous example, since B^2
includes
a B
term.'Y ~ A + B +
C + A:B'
includes an A*B
term.'Y
~ A*B + C'
is the same as the previous example, since A*B
= A + B + A:B
.'Y ~ A*B*C  A:B:C'
has
all interactions among A
, B
,
and C
, except the threeway interaction.'Y
~ A*(B + C + D)'
has all linear terms, plus products of A
with
each of the other variables.
For example, to specify an interaction model using fitlm
with matrix predictors:
mdl = fitlm(X,y,'y ~ x1*x2*x3  x1:x2:x3');
To specify a model using stepwiselm
and
a table or dataset array tbl
of predictors, suppose
you want to start from a constant and have a linear model upper bound.
Assume the response variable in tbl
is named 'y'
,
and the predictor variables are named 'x1'
, 'x2'
,
and 'x3'
.
mdl2 = stepwiselm(tbl,'y ~ 1','Upper','y ~ x1 + x2 + x3');
The most common optional arguments for fitting:
For robust regression in fitlm
,
set the 'RobustOpts'
namevalue pair to 'on'
.
Specify an appropriate upper bound model in stepwiselm
, such as set 'Upper'
to 'linear'
.
Indicate which variables are categorical using the 'CategoricalVars'
namevalue
pair. Provide a vector with column numbers, such as [1 6]
to
specify that predictors 1
and 6
are
categorical. Alternatively, give a logical vector the same length
as the data columns, with a 1
entry indicating
that variable is categorical. If there are seven predictors, and predictors 1
and 6
are
categorical, specify logical([1,0,0,0,0,1,0])
.
For a table or dataset array, specify the response
variable using the 'ResponseVar'
namevalue pair.
The default is the last column in the array.
For example,
mdl = fitlm(X,y,'linear',... 'RobustOpts','on','CategoricalVars',3); mdl2 = stepwiselm(tbl,'constant',... 'ResponseVar','MPG','Upper','quadratic');
After fitting a model, examine the result.
A linear regression model shows several diagnostics when you
enter its name or enter disp(mdl)
. This display
gives some of the basic information to check whether the fitted model
represents the data adequately.
For example, fit a linear model to data constructed with two out of five predictors not present and with no intercept term:
X = randn(100,5); y = X*[1;0;3;0;1]+randn(100,1); mdl = fitlm(X,y)
mdl = Linear regression model: y ~ 1 + x1 + x2 + x3 + x4 + x5 Estimated Coefficients: Estimate SE tStat pValue (Intercept) 0.038164 0.099458 0.38372 0.70205 x1 0.92794 0.087307 10.628 8.5494e18 x2 0.075593 0.10044 0.75264 0.45355 x3 2.8965 0.099879 29 1.1117e48 x4 0.045311 0.10832 0.41831 0.67667 x5 0.99708 0.11799 8.4504 3.593e13 Number of observations: 100, Error degrees of freedom: 94 Root Mean Squared Error: 0.972 Rsquared: 0.93, Adjusted RSquared 0.926 Fstatistic vs. constant model: 248, pvalue = 1.5e52
Notice that:
The display contains the estimated values of each
coefficient in the Estimate
column. These values
are reasonably near the true values [0;1;0;3;0;1]
.
There is a standard error column for the coefficient estimates.
The reported pValue
(which are
derived from the t statistics under the assumption
of normal errors) for predictors 1, 3, and 5 are extremely small.
These are the three predictors that were used to create the response
data y
.
The pValue
for (Intercept)
, x2
and x4
are
much larger than 0.01. These three predictors were not used to create
the response data y
.
The display contains R^{2}, adjusted R^{2}, and F statistics.
To examine the quality of the fitted model, consult an ANOVA
table. For example, use anova
on a linear model with five predictors:
X = randn(100,5); y = X*[1;0;3;0;1]+randn(100,1); mdl = fitlm(X,y); tbl = anova(mdl)
tbl = SumSq DF MeanSq F pValue x1 106.62 1 106.62 112.96 8.5494e18 x2 0.53464 1 0.53464 0.56646 0.45355 x3 793.74 1 793.74 840.98 1.1117e48 x4 0.16515 1 0.16515 0.17498 0.67667 x5 67.398 1 67.398 71.41 3.593e13 Error 88.719 94 0.94382
This table gives somewhat different results than the default
display (see Model Display). The
table clearly shows that the effects of x2
and x4
are
not significant. Depending on your goals, consider removing x2
and x4
from
the model.
Diagnostic plots help you identify outliers, and see other problems
in your model or fit. For example, load the carsmall
data,
and make a model of MPG
as a function of Cylinders
(nominal)
and Weight
:
load carsmall tbl = table(Weight,MPG,Cylinders); tbl.Cylinders = ordinal(tbl.Cylinders); mdl = fitlm(tbl,'MPG ~ Cylinders*Weight + Weight^2');ˋ
Make a leverage plot of the data and model.
plotDiagnostics(mdl)
There are a few points with high leverage. But this plot does not reveal whether the highleverage points are outliers.
Look for points with large Cook's distance.
plotDiagnostics(mdl,'cookd')
There is one point with large Cook's distance. Identify it and remove it from the model. You can use the Data Cursor to click the outlier and identify it, or identify it programmatically:
[~,larg] = max(mdl.Diagnostics.CooksDistance); mdl2 = fitlm(tbl,'MPG ~ Cylinders*Weight + Weight^2',... 'Exclude',larg);
There are several residual plots to help you discover errors, outliers, or correlations in the model or data. The simplest residual plots are the default histogram plot, which shows the range of the residuals and their frequencies, and the probability plot, which shows how the distribution of the residuals compares to a normal distribution with matched variance.
Load the carsmall
data, and make a model
of MPG
as a function of Cylinders
(nominal)
and Weight
:
load carsmall tbl = table(Weight,MPG,Cylinders); tbl.Cylinders = ordinal(tbl.Cylinders); mdl = fitlm(tbl,'MPG ~ Cylinders*Weight + Weight^2');
Examine the residuals:
plotResiduals(mdl)
The observations above 12 are potential outliers.
plotResiduals(mdl,'probability')
The two potential outliers appear on this plot as well. Otherwise, the probability plot seems reasonably straight, meaning a reasonable fit to normally distributed residuals.
You can identify the two outliers and remove them from the data:
outl = find(mdl.Residuals.Raw > 12)
outl = 90 97
To remove the outliers, use the Exclude
namevalue
pair:
mdl2 = fitlm(tbl,'MPG ~ Cylinders*Weight + Weight^2',... 'Exclude',outl);
Examine a residuals plot of mdl2
:
plotResiduals(mdl2)
The new residuals plot looks fairly symmetric, without obvious problems. However, there might be some serial correlation among the residuals. Create a new plot to see if such an effect exists.
plotResiduals(mdl2,'lagged')
The scatter plot shows many more crosses in the upperright and lowerleft quadrants than in the other two quadrants, indicating positive serial correlation among the residuals.
Another potential issue is when residuals are large for large observations. See if the current model has this issue.
plotResiduals(mdl2,'fitted')
There is some tendency for larger fitted values to have larger residuals. Perhaps the model errors are proportional to the measured values.
This example shows how to understand the effect each predictor has on a regression model using a variety of available plots.
Create a model of mileage from some predictors in
the carsmall
data.
load carsmall tbl = table(Weight,MPG,Cylinders); tbl.Cylinders = ordinal(tbl.Cylinders); mdl = fitlm(tbl,'MPG ~ Cylinders*Weight + Weight^2');
Examine a slice plot of the responses. This displays the effect of each predictor separately.
plotSlice(mdl)
You can drag the individual predictor values, which are represented by dashed blue vertical lines. You can also choose between simultaneous and nonsimultaneous confidence bounds, which are represented by dashed red curves.
Use an effects plot to show another view of the effect of predictors on the response.
plotEffects(mdl)
This plot shows that changing Weight
from
about 2500 to 4732 lowers MPG
by about 30 (the
location of the upper blue circle). It also shows that changing the
number of cylinders from 8 to 4 raises MPG
by about
10 (the lower blue circle). The horizontal blue lines represent confidence
intervals for these predictions. The predictions come from averaging
over one predictor as the other is changed. In cases such as this,
where the two predictors are correlated, be careful when interpreting
the results.
Instead of viewing the effect of averaging over a predictor as the other is changed, examine the joint interaction in an interaction plot.
plotInteraction(mdl,'Weight','Cylinders')
The interaction plot shows the effect of changing one predictor
with the other held fixed. In this case, the plot is much more informative.
It shows, for example, that lowering the number of cylinders in a
relatively light car (Weight
= 1795) leads to an
increase in mileage, but lowering the number of cylinders in a relatively
heavy car (Weight
= 4732) leads to a decrease in
mileage.
For an even more detailed look at the interactions, look at an interaction plot with predictions. This plot holds one predictor fixed while varying the other, and plots the effect as a curve. Look at the interactions for various fixed numbers of cylinders.
plotInteraction(mdl,'Cylinders','Weight','predictions')
Now look at the interactions with various fixed levels of weight.
plotInteraction(mdl,'Weight','Cylinders','predictions')
This example shows how to understand the effect of each term in a regression model using a variety of available plots.
Create a model of mileage from some predictors in
the carsmall
data.
load carsmall tbl = table(Weight,MPG,Cylinders); tbl.Cylinders = ordinal(tbl.Cylinders); mdl = fitlm(tbl,'MPG ~ Cylinders*Weight + Weight^2');
Create an added variable plot with Weight^2
as
the added variable.
plotAdded(mdl,'Weight^2')
This plot shows the results of fitting both Weight^2
and MPG
to
the terms other than Weight^2
. The reason to use plotAdded
is
to understand what additional improvement in the model you get by
adding Weight^2
. The coefficient of a line fit
to these points is the coefficient of Weight^2
in
the full model. The Weight^2
predictor is just
over the edge of significance (pValue
< 0.05)
as you can see in the coefficients table display. You can see that
in the plot as well. The confidence bounds look like they could not
contain a horizontal line (constant y), so a zeroslope model is not
consistent with the data.
Create an added variable plot for the model as a whole.
plotAdded(mdl)
The model as a whole is very significant, so the bounds don't come close to containing a horizontal line. The slope of the line is the slope of a fit to the predictors projected onto their bestfitting direction, or in other words, the norm of the coefficient vector.
There are two ways to change a model:
step
—
Add or subtract terms one at a time, where step
chooses
the most important term to add or remove.
addTerms
and removeTerms
—
Add or remove specified terms. Give the terms in any of the forms
described in Choose a Model or Range of Models.
If you created a model using stepwiselm
, step
can
have an effect only if you give different upper or lower models. step
does
not work when you fit a model using RobustOpts
.
For example, start with a linear model of mileage from the carbig
data:
load carbig tbl = table(Acceleration,Displacement,Horsepower,Weight,MPG); mdl = fitlm(tbl,'linear','ResponseVar','MPG')
mdl = Linear regression model: MPG ~ 1 + Acceleration + Displacement + Horsepower + Weight Estimated Coefficients: Estimate SE tStat pValue (Intercept) 45.251 2.456 18.424 7.0721e55 Acceleration 0.023148 0.1256 0.1843 0.85388 Displacement 0.0060009 0.0067093 0.89441 0.37166 Horsepower 0.043608 0.016573 2.6312 0.008849 Weight 0.0052805 0.00081085 6.5123 2.3025e10 Number of observations: 392, Error degrees of freedom: 387 Root Mean Squared Error: 4.25 Rsquared: 0.707, Adjusted RSquared 0.704 Fstatistic vs. constant model: 233, pvalue = 9.63e102
Try to improve the model using step
for up
to 10 steps:
mdl1 = step(mdl,'NSteps',10)
1. Adding Displacement:Horsepower, FStat = 87.4802, pValue = 7.05273e19 mdl1 = Linear regression model: MPG ~ 1 + Acceleration + Weight + Displacement*Horsepower Estimated Coefficients: Estimate SE tStat pValue (Intercept) 61.285 2.8052 21.847 1.8593e69 Acceleration 0.34401 0.11862 2.9 0.0039445 Displacement 0.081198 0.010071 8.0623 9.5014e15 Horsepower 0.24313 0.026068 9.3265 8.6556e19 Weight 0.0014367 0.00084041 1.7095 0.088166 Displacement:Horsepower 0.00054236 5.7987e05 9.3531 7.0527e19 Number of observations: 392, Error degrees of freedom: 386 Root Mean Squared Error: 3.84 Rsquared: 0.761, Adjusted RSquared 0.758 Fstatistic vs. constant model: 246, pvalue = 1.32e117
step
stopped after just one change.
To try to simplify the model, remove the Acceleration
and Weight
terms
from mdl1
:
mdl2 = removeTerms(mdl1,'Acceleration + Weight')
mdl2 = Linear regression model: MPG ~ 1 + Displacement*Horsepower Estimated Coefficients: Estimate SE tStat pValue (Intercept) 53.051 1.526 34.765 3.0201e121 Displacement 0.098046 0.0066817 14.674 4.3203e39 Horsepower 0.23434 0.019593 11.96 2.8024e28 Displacement:Horsepower 0.00058278 5.193e05 11.222 1.6816e25 Number of observations: 392, Error degrees of freedom: 388 Root Mean Squared Error: 3.94 Rsquared: 0.747, Adjusted RSquared 0.745 Fstatistic vs. constant model: 381, pvalue = 3e115
mdl2
uses just Displacement
and Horsepower
,
and has nearly as good a fit to the data as mdl1
in
the Adjusted RSquared
metric.
There are three ways to use a linear model to predict or simulate the response to new data:
This example shows how to predict and obtain confidence intervals
on the predictions using the predict
method.
Load the carbig
data and make a
default linear model of the response MPG
to the Acceleration
, Displacement
, Horsepower
,
and Weight
predictors.
load carbig
X = [Acceleration,Displacement,Horsepower,Weight];
mdl = fitlm(X,MPG);
Create a threerow array of predictors from the minimal,
mean, and maximal values. There are some NaN
values,
so use functions that ignore NaN
values.
Xnew = [nanmin(X);nanmean(X);nanmax(X)]; % new data
Find the predicted model responses and confidence intervals on the predictions.
[NewMPG NewMPGCI] = predict(mdl,Xnew)
NewMPG = 34.1345 23.4078 4.7751 NewMPGCI = 31.6115 36.6575 22.9859 23.8298 0.6134 8.9367
The confidence bound on the mean response is narrower than those for the minimum or maximum responses, which is quite sensible.
When you construct a model from a table or dataset array, feval
is
often more convenient for predicting mean responses than predict
.
However, feval
does not provide confidence bounds.
This example shows how to predict mean responses using the feval
method.
Load the carbig
data and make a
default linear model of the response MPG
to the Acceleration
, Displacement
, Horsepower
,
and Weight
predictors.
load carbig tbl = table(Acceleration,Displacement,Horsepower,Weight,MPG); mdl = fitlm(tbl,'linear','ResponseVar','MPG');
Create a threerow array of predictors from the minimal,
mean, and maximal values. There are some NaN
values,
so use functions that ignore NaN
values.
X = [Acceleration,Displacement,Horsepower,Weight];
Xnew = [nanmin(X);nanmean(X);nanmax(X)]; % new data
The Xnew
array has the correct number of
columns for prediction, so feval
can use it for
predictions.
Find the predicted model responses.
NewMPG = feval(mdl,Xnew)
NewMPG = 34.1345 23.4078 4.7751
The random
method
simulates new random response values, equal to the mean prediction
plus a random disturbance with the same variance as the training data.
This example shows how to simulate responses using the random
method.
Load the carbig
data and make a
default linear model of the response MPG
to the Acceleration
, Displacement
, Horsepower
,
and Weight
predictors.
load carbig
X = [Acceleration,Displacement,Horsepower,Weight];
mdl = fitlm(X,MPG);
Create a threerow array of predictors from the minimal,
mean, and maximal values. There are some NaN
values,
so use functions that ignore NaN
values.
Xnew = [nanmin(X);nanmean(X);nanmax(X)]; % new data
Generate new predicted model responses including some randomness.
rng('default') % for reproducibility NewMPG = random(mdl,Xnew)
NewMPG = 36.4178 31.1958 4.8176
Because a negative value of MPG
does
not seem sensible, try predicting two more times.
NewMPG = random(mdl,Xnew)
NewMPG = 37.7959 24.7615 0.7783
NewMPG = random(mdl,Xnew)
NewMPG = 32.2931 24.8628 19.9715
Clearly, the predictions for the third (maximal) row of Xnew
are
not reliable.
Suppose you have a linear regression model, such as mdl
from
the following commands:
load carbig tbl = table(Acceleration,Displacement,Horsepower,Weight,MPG); mdl = fitlm(tbl,'linear','ResponseVar','MPG');
To share the model with other people, you can:
Provide the model display.
mdl
mdl = Linear regression model: MPG ~ 1 + Acceleration + Displacement + Horsepower + Weight Estimated Coefficients: Estimate SE tStat pValue (Intercept) 45.251 2.456 18.424 7.0721e55 Acceleration 0.023148 0.1256 0.1843 0.85388 Displacement 0.0060009 0.0067093 0.89441 0.37166 Horsepower 0.043608 0.016573 2.6312 0.008849 Weight 0.0052805 0.00081085 6.5123 2.3025e10 Number of observations: 392, Error degrees of freedom: 387 Root Mean Squared Error: 4.25 Rsquared: 0.707, Adjusted RSquared 0.704 Fstatistic vs. constant model: 233, pvalue = 9.63e102
Provide just the model definition and coefficients.
mdl.CoefficientNames
ans = '(Intercept)' 'Acceleration' 'Displacement' 'Horsepower' 'Weight'
mdl.Coefficients.Estimate
ans = 45.2511 0.0231 0.0060 0.0436 0.0053
mdl.Formula
ans = MPG ~ 1 + Acceleration + Displacement + Horsepower + Weight
This example shows how to fit a linear regression model. A typical workflow involves the following: import data, fit a regression, test its quality, modify it to improve the quality, and share it.
Step 1. Import the data into a dataset array.
hospital.xls
is an Excel® spreadsheet containing patient names, sex, age, weight, blood pressure, and dates of treatment in an experimental protocol. First read the data into a table.
patients = readtable('hospital.xls',... 'ReadRowNames',true);
Examine the first row of data.
patients(1,:)
ans = name sex age wgt smoke sys dia trial1 trial2 trial3 trial4 _______ ___ ___ ___ _____ ___ ___ ______ ______ ______ ______ YPL320 'SMITH' 'm' 38 176 1 124 93 18 99 99 99
The sex
and smoke
fields seem to have two choices each. So change these fields to nominal.
patients.smoke = nominal(patients.smoke,{'No','Yes'}); patients.sex = nominal(patients.sex);
Step 2. Create a fitted model.
Your goal is to model the systolic pressure as a function of a patient's age, weight, sex, and smoking status. Create a linear formula for 'sys'
as a function of 'age'
, 'wgt'
, 'sex'
, and 'smoke'
.
modelspec = 'sys ~ age + wgt + sex + smoke';
mdl = fitlm(patients,modelspec)
mdl = Linear regression model: sys ~ 1 + sex + age + wgt + smoke Estimated Coefficients: Estimate SE tStat pValue _________ ________ ________ __________ (Intercept) 118.28 7.6291 15.504 9.1557e28 sex_m 0.88162 2.9473 0.29913 0.76549 age 0.08602 0.06731 1.278 0.20438 wgt 0.016685 0.055714 0.29947 0.76524 smoke_Yes 9.884 1.0406 9.498 1.9546e15 Number of observations: 100, Error degrees of freedom: 95 Root Mean Squared Error: 4.81 Rsquared: 0.508, Adjusted RSquared 0.487 Fstatistic vs. constant model: 24.5, pvalue = 5.99e14
The sex, age, and weight predictors have rather high values, indicating that some of these predictors might be unnecessary.
Step 3. Locate and remove outliers.
See if there are outliers in the data that should be excluded from the fit. Plot the residuals.
plotResiduals(mdl)
There is one possible outlier, with a value greater than 12. This is probably not truly an outlier. For demonstration, here is how to find and remove it.
Find the outlier.
outlier = mdl.Residuals.Raw > 12; find(outlier)
ans = 84
Remove the outlier.
mdl = fitlm(patients,modelspec,... 'Exclude',84); mdl.ObservationInfo(84,:)
ans = Weights Excluded Missing Subset _______ ________ _______ ______ WXM486 1 true false false
Observation 84 is no longer in the model.
Step 4. Simplify the model.
Try to obtain a simpler model, one with fewer predictors but the same predictive accuracy. step
looks for a better model by adding or removing one term at a time. Allow step
take up to 10 steps.
mdl1 = step(mdl,'NSteps',10)
1. Removing wgt, FStat = 4.6001e05, pValue = 0.9946 2. Removing sex, FStat = 0.063241, pValue = 0.80199 mdl1 = Linear regression model: sys ~ 1 + age + smoke Estimated Coefficients: Estimate SE tStat pValue ________ ________ ______ __________ (Intercept) 115.11 2.5364 45.383 1.1407e66 age 0.10782 0.064844 1.6628 0.09962 smoke_Yes 10.054 0.97696 10.291 3.5276e17 Number of observations: 99, Error degrees of freedom: 96 Root Mean Squared Error: 4.61 Rsquared: 0.536, Adjusted RSquared 0.526 Fstatistic vs. constant model: 55.4, pvalue = 1.02e16
step
took two steps. This means it could not improve the model further by adding or subtracting a single term.
Plot the effectiveness of the simpler model on the training data.
plotResiduals(mdl1)
The residuals look about as small as those of the original model.
Step 5. Predict responses to new data.
Suppose you have four new people, aged 25, 30, 40, and 65, and the first and third smoke. Predict their systolic pressure using mdl1
.
ages = [25;30;40;65]; smoker = {'Yes';'No';'Yes';'No'}; systolicnew = feval(mdl1,ages,smoker)
systolicnew = 127.8561 118.3412 129.4734 122.1149
To make predictions, you need only the variables that mdl1
uses.
Step 6. Share the model.
You might want others to be able to use your model for prediction. Access the terms in the linear model.
coefnames = mdl1.CoefficientNames
coefnames = '(Intercept)' 'age' 'smoke_Yes'
View the model formula.
mdl1.Formula
ans = sys ~ 1 + age + smoke
Access the coefficients of the terms.
coefvals = mdl1.Coefficients(:,1); % table
coefvals = table2array(coefvals)
coefvals = 115.1066 0.1078 10.0540
The model is sys = 115.1066 + 0.1078*age + 10.0540*smoke
, where smoke
is 1
for a smoker, and 0
otherwise.