Stepwise regression
b = stepwisefit(X,y)
[b,se,pval,inmodel,stats,nextstep,history] = stepwisefit(...)
[...] = stepwisefit(X,y,param1
,val1
,param2
,val2
,...)
b = stepwisefit(X,y)
uses a stepwise method
to perform a multilinear regression of the response values in the nby1
vector y
on the p predictive
terms in the nbyp matrix X
.
Distinct predictive terms should appear in different columns of X
.
b
is a pby1 vector of
estimated coefficients for all of the terms in X
.
The stepwisefit
function calculates the coefficient
estimate values in b
as follows:
If a term is not in the final model, then the corresponding
coefficient estimate in b
results from adding only
that term to the predictors in the final model.
If a term is in the final model, then the coefficient
estimate in b
for that term is a result of the
final model, that is stepwise
does not consider
the terms it excluded from the model while computing these values.
stepwisefit
automatically includes a constant
term in all models. Do not enter a column of 1s directly into X
.
stepwisefit
treats NaN
values
in either X
or y
as missing
values, and ignores them.
[b,se,pval,inmodel,stats,nextstep,history] = stepwisefit(...)
returns
the following additional information:
se
— A vector of standard
errors for b
pval
— A vector of pvalues
for testing whether elements of b
are 0
inmodel
— A logical vector,
with length equal to the number of columns in X
,
specifying which terms are in the final model
stats
— A structure of additional
statistics with the following fields. All statistics pertain to the
final model except where noted.
source
— The character vector 'stepwisefit'
dfe
— Degrees of freedom
for error
df0
— Degrees of freedom
for the regression
SStotal
— Total sum of squares
of the response
SSresid
— Sum of squares
of the residuals
fstat
— Fstatistic
for testing the final model vs. no model (mean only)
pval
— p value
of the Fstatistic
rmse
— Root mean square
error
xr
— Residuals for predictors
not in the final model, after removing the part of them explained
by predictors in the model
yr
— Residuals for the response
using predictors in the final model
B
— Coefficients for terms
in final model, with values for a term not in the model set to the
value that would be obtained by adding that term to the model
SE
— Standard errors for
coefficient estimates
TSTAT
— t statistics
for coefficient estimates
PVAL
— pvalues
for coefficient estimates
intercept
— Estimated intercept
wasnan
— Indicates which
rows in the data contained NaN
values
nextstep
— The recommended
next step—either the index of the next term to move in or out
of the model, or 0
if no further steps are recommended
history
— Structure containing
information on steps taken, with the following fields:
B
— Matrix of regression
coefficients, where each column is one step, and each row is one coefficient.
rmse
— Root mean square
errors for the model at each step.
df0
— Degrees of freedom
for the regression at each step.
in
— Logical array indicating
which predictors are in the model at each step, where each row is
one step, and each column is one predictor.
[...] = stepwisefit(X,y,
specifies
one or more of the name/value pairs described in the following table.param1
,val1
,param2
,val2
,...)
Parameter  Value 

'inmodel'  A logical vector specifying terms to include in the initial fit. The default is to specify no terms. 
'penter'  The maximum p value for a term to
be added. The default is 
'premove'  The minimum p value for a term to
be removed. The default is the maximum of the value of 
'display' 

'maxiter'  The maximum number of steps in the regression. The default
is 
'keep'  A logical vector specifying terms to keep in their initial state. The default is to specify no terms. 
'scale' 

Load the data in hald.mat
, which contains
observations of the heat of reaction of various cement mixtures:
load hald whos Name Size Bytes Class Attributes Description 22x58 2552 char hald 13x5 520 double heat 13x1 104 double ingredients 13x4 416 double
The response (heat
) depends on the quantities
of the four predictors (the columns of ingredients
).
Use stepwisefit
to carry out the stepwise
regression algorithm, beginning with no terms in the model and using
entrance/exit tolerances of 0.05/0.10 on the pvalues:
stepwisefit(ingredients,heat,... 'penter',0.05,'premove',0.10); Initial columns included: none Step 1, added column 4, p=0.000576232 Step 2, added column 1, p=1.10528e006 Final columns included: 1 4 'Coeff' 'Std.Err.' 'Status' 'P' [ 1.4400] [ 0.1384] 'In' [1.1053e006] [ 0.4161] [ 0.1856] 'Out' [ 0.0517] [0.4100] [ 0.1992] 'Out' [ 0.0697] [0.6140] [ 0.0486] 'In' [1.8149e007]
stepwisefit
automatically includes an intercept
term in the model, so you do not add it explicitly to ingredients
as
you would for regress
. For terms
not in the model, coefficient estimates and their standard errors
are those that result by adding the corresponding term to the final
model.
The inmodel
parameter is used to specify
terms in an initial model:
initialModel = ... [false true false false]; % Force in 2nd term stepwisefit(ingredients,heat,... 'inmodel',initialModel,... 'penter',.05,'premove',0.10); Initial columns included: 2 Step 1, added column 1, p=2.69221e007 Final columns included: 1 2 'Coeff' 'Std.Err.' 'Status' 'P' [ 1.4683] [ 0.1213] 'In' [2.6922e007] [ 0.6623] [ 0.0459] 'In' [5.0290e008] [ 0.2500] [ 0.1847] 'Out' [ 0.2089] [0.2365] [ 0.1733] 'Out' [ 0.2054]
The preceding two models, built from different initial models, use different subsets of the predictive terms. Terms 2 and 4, swapped in the two models, are highly correlated:
term2 = ingredients(:,2); term4 = ingredients(:,4); R = corrcoef(term2,term4) R = 1.0000 0.9730 0.9730 1.0000
To compare the models, use the stats
output
of stepwisefit
:
[betahat1,se1,pval1,inmodel1,stats1] = ... stepwisefit(ingredients,heat,... 'penter',.05,'premove',0.10,... 'display','off'); [betahat2,se2,pval2,inmodel2,stats2] = ... stepwisefit(ingredients,heat,... 'inmodel',initialModel,... 'penter',.05,'premove',0.10,... 'display','off'); RMSE1 = stats1.rmse RMSE1 = 2.7343 RMSE2 = stats2.rmse RMSE2 = 2.4063
The second model has a lower Root Mean Square Error (RMSE).
Stepwise regression is a systematic method for adding and removing terms from a multilinear model based on their statistical significance in a regression. The method begins with an initial model and then compares the explanatory power of incrementally larger and smaller models. At each step, the p value of an Fstatistic is computed to test models with and without a potential term. If a term is not currently in the model, the null hypothesis is that the term would have a zero coefficient if added to the model. If there is sufficient evidence to reject the null hypothesis, the term is added to the model. Conversely, if a term is currently in the model, the null hypothesis is that the term has a zero coefficient. If there is insufficient evidence to reject the null hypothesis, the term is removed from the model. The method proceeds as follows:
Fit the initial model.
If any terms not in the model have pvalues less than an entrance tolerance (that is, if it is unlikely that they would have zero coefficient if added to the model), add the one with the smallest p value and repeat this step; otherwise, go to step 3.
If any terms in the model have pvalues greater than an exit tolerance (that is, if it is unlikely that the hypothesis of a zero coefficient can be rejected), remove the one with the largest p value and go to step 2; otherwise, end.
Depending on the terms included in the initial model and the order in which terms are moved in and out, the method may build different models from the same set of potential terms. The method terminates when no single step improves the model. There is no guarantee, however, that a different initial model or a different sequence of steps will not lead to a better fit. In this sense, stepwise models are locally optimal, but may not be globally optimal.
[1] Draper, N. R., and H. Smith. Applied Regression Analysis. Hoboken, NJ: WileyInterscience, 1998. pp. 307–312.