stepwisefit

Fit linear regression model using stepwise regression

Syntax

``b = stepwisefit(X,y)``
``b = stepwisefit(X,y,Name,Value)``
``[b,se,pval] = stepwisefit(___)``
``[b,se,pval,finalmodel,stats] = stepwisefit(___)``
``[b,se,pval,finalmodel,stats,nextstep,history] = stepwisefit(___)``

Description

example

````b = stepwisefit(X,y)` returns a vector `b` of coefficient estimates from stepwise regression of the response vector `y` on the predictor variables in matrix `X`. `stepwisefit` begins with an initial constant model and takes forward or backward steps to add or remove variables, until a stopping criterion is satisfied.```

example

````b = stepwisefit(X,y,Name,Value)` specifies additional options using one or more name-value pair arguments. For example, you can specify a nonconstant initial model, or a maximum number of steps that `stepwisefit` can take.```

example

````[b,se,pval] = stepwisefit(___)` returns the coefficient estimates `b`, standard errors `se`, and p-values `pval` using any of the input argument combinations in previous syntaxes.```

example

````[b,se,pval,finalmodel,stats] = stepwisefit(___)` also returns a specification of the variables in the final regression model `finalmodel`, and statistics `stats` about the final model.```

example

````[b,se,pval,finalmodel,stats,nextstep,history] = stepwisefit(___)` also returns the recommended next step `nextstep` and information `history` about all the steps taken.```

Examples

collapse all

Perform a basic stepwise regression and obtain the coefficient estimates.

Load the `hald` data set.

```load hald whos % Check variables loaded in workspace```
``` Name Size Bytes Class Attributes Description 22x58 2552 char hald 13x5 520 double heat 13x1 104 double ingredients 13x4 416 double ```

This data set contains observations of the heat evolved during cement hardening for various mixtures of four cement ingredients. The response variable is `heat`. The matrix `ingredients` contains four columns of predictors.

Run `stepwisefit` beginning with only a constant term in the model and using the default entry and exit tolerances of 0.05 and 0.10, respectively.

`b = stepwisefit(ingredients,heat)`
```Initial columns included: none Step 1, added column 4, p=0.000576232 Step 2, added column 1, p=1.10528e-06 Final columns included: 1 4 {'Coeff' } {'Std.Err.'} {'Status'} {'P' } {[ 1.4400]} {[ 0.1384]} {'In' } {[1.1053e-06]} {[ 0.4161]} {[ 0.1856]} {'Out' } {[ 0.0517]} {[-0.4100]} {[ 0.1992]} {'Out' } {[ 0.0697]} {[-0.6140]} {[ 0.0486]} {'In' } {[1.8149e-07]} ```
```b = 4×1 1.4400 0.4161 -0.4100 -0.6140 ```

The `stepwisefit` display shows that columns `1` and `4` are included in the final model. The output `b` includes estimates for all columns, even those that do not appear in the final model. `stepwisefit` computes the estimate for column `2` (or `3`) by fitting a model consisting of the final model plus column `2` (or `3`).

Load the `carsmall` data set, which contains various car measurements.

```load carsmall whos```
``` Name Size Bytes Class Attributes Acceleration 100x1 800 double Cylinders 100x1 800 double Displacement 100x1 800 double Horsepower 100x1 800 double MPG 100x1 800 double Mfg 100x13 2600 char Model 100x33 6600 char Model_Year 100x1 800 double Origin 100x7 1400 char Weight 100x1 800 double ```

Perform stepwise regression with four continuous variables and the response variable `MPG`.

```X = [Acceleration Cylinders Displacement Horsepower]; y = MPG; b4_default = stepwisefit(X,y) % Stepwise regression with default arguments```
```Initial columns included: none Step 1, added column 2, p=1.59001e-25 Step 2, added column 4, p=0.00364266 Step 3, added column 1, p=0.0161414 Final columns included: 1 2 4 {'Coeff' } {'Std.Err.'} {'Status'} {'P' } {[-0.4517]} {[ 0.1842]} {'In' } {[ 0.0161]} {[-2.6407]} {[ 0.4823]} {'In' } {[4.0003e-07]} {[ 0.0148]} {[ 0.0157]} {'Out' } {[ 0.3472]} {[-0.0772]} {[ 0.0204]} {'In' } {[2.6922e-04]} ```
```b4_default = 4×1 -0.4517 -2.6407 0.0148 -0.0772 ```

The term `Displacement` never enters the model. Determine if it is highly correlated with the other three terms by computing the term correlation matrix.

`corrcoef(X,'Rows','complete') % To exclude rows with missing values from calculation`
```ans = 4×4 1.0000 -0.6438 -0.6968 -0.6968 -0.6438 1.0000 0.9517 0.8622 -0.6968 0.9517 1.0000 0.9134 -0.6968 0.8622 0.9134 1.0000 ```

The third row of the correlation matrix corresponds to `Displacement`. This term is highly correlated with the other three terms, especially `Cylinders `(`0.95`) and `Horsepower` (`0.91`).

Redefine the input matrix `X` to include `Weight`. Specify an initial model containing the terms `Displacement` and `Horsepower` by using the '`InModel'` name-value pair argument.

```X = [Acceleration Cylinders Displacement Horsepower Weight]; inmodel = [false false true true false]; b5_inmodel = stepwisefit(X,y,'InModel',inmodel)```
```Initial columns included: 3 4 Step 1, added column 5, p=1.06457e-06 Step 2, added column 2, p=0.00410234 Final columns included: 2 3 4 5 {'Coeff' } {'Std.Err.'} {'Status'} {'P' } {[-0.0912]} {[ 0.2032]} {'Out' } {[ 0.6548]} {[-2.3223]} {[ 0.7879]} {'In' } {[ 0.0041]} {[ 0.0252]} {[ 0.0145]} {'In' } {[ 0.0862]} {[-0.0449]} {[ 0.0231]} {'In' } {[ 0.0555]} {[-0.0050]} {[ 0.0012]} {'In' } {[1.0851e-04]} ```
```b5_inmodel = 5×1 -0.0912 -2.3223 0.0252 -0.0449 -0.0050 ```

The final model consists of terms `2–5`. However, `Displacement` and `Horsepower` estimates have $p$-values greater than `0.05` in the final model. You can tune the stepwise algorithm to behave more conservatively by using the `'PRemove'` name-value pair argument. For example, setting `'PRemove'` to `0.05` (instead of the default `0.1`) results in a smaller final model with only two terms, each with a $p$-value less than `0.05`.

`b5_inmodel_premove = stepwisefit(X,y,'InModel',inmodel,'PRemove',0.05)`
```Initial columns included: 3 4 Step 1, added column 5, p=1.06457e-06 Step 2, added column 2, p=0.00410234 Step 3, removed column 3, p=0.0862131 Step 4, removed column 4, p=0.239239 Final columns included: 2 5 {'Coeff' } {'Std.Err.'} {'Status'} {'P' } {[-0.0115]} {[ 0.1656]} {'Out' } {[ 0.9449]} {[-1.6037]} {[ 0.5146]} {'In' } {[ 0.0025]} {[ 0.0101]} {[ 0.0124]} {'Out' } {[ 0.4186]} {[-0.0234]} {[ 0.0198]} {'Out' } {[ 0.2392]} {[-0.0055]} {[ 0.0011]} {'In' } {[3.9038e-06]} ```
```b5_inmodel_premove = 5×1 -0.0115 -1.6037 0.0101 -0.0234 -0.0055 ```

Center and scale each column (compute the z-scores) before fitting by using the `'Scale'` name-value pair argument. The scaling does not change the model selected, the signs of coefficient estimates, or their $p$-values. However, the scaling does scale the coefficient estimates.

`b5_inmodel_premove_scale = stepwisefit(X,y,'InModel',inmodel,'PRemove',0.05,'Scale','on')`
```Initial columns included: 3 4 Step 1, added column 5, p=1.06457e-06 Step 2, added column 2, p=0.00410234 Step 3, removed column 3, p=0.0862131 Step 4, removed column 4, p=0.239239 Final columns included: 2 5 {'Coeff' } {'Std.Err.'} {'Status'} {'P' } {[-0.0370]} {[ 0.5339]} {'Out' } {[ 0.9449]} {[-2.8136]} {[ 0.9028]} {'In' } {[ 0.0025]} {[ 1.1155]} {[ 1.3726]} {'Out' } {[ 0.4186]} {[-1.0617]} {[ 0.8961]} {'Out' } {[ 0.2392]} {[-4.4406]} {[ 0.9028]} {'In' } {[3.9038e-06]} ```
```b5_inmodel_premove_scale = 5×1 -0.0370 -2.8136 1.1155 -1.0617 -4.4406 ```

Usually, you scale to compare estimates of terms that are measured in different scales, such as `Horsepower` and `Weight`. In this case, increasing `Horsepower` by one standard deviation leads to an expected drop of `1` in `MPG`, whereas increasing `Weight` by one standard deviation leads to an expected drop of `4.4` in `MPG`.

Load the `imports-85` data set. This data set contains characteristics of cars imported in 1985. For a list of all column names, see the variable `Description` in the workspace or type `Description` at the command line.

```load imports-85 whos ```
``` Name Size Bytes Class Attributes Description 9x79 1422 char X 205x26 42640 double ```

Choose a subset of continuous variables to use in stepwise regression, consisting of the predictor variables `engine-size`, `bore`, `stroke`, `compression-ratio`, `horsepower`, `peak-rpm`, `city-mpg`, and `highway-mpg`, and the response variable `price`.

```varnames = ["engine-size","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price"]; % Variable names to use in stepwise regression dataTbl = array2table(X(:,8:16),'VariableNames',varnames); % Create data table with variable names Xstepw = dataTbl{:,{'engine-size','bore','stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg'}}; % Input matrix ystepw = dataTbl{:,{'price'}}; % Response vector```

Run `stepwisefit` of the variable `price` on the other eight variables, first with the default constant initial model and then with an initial model including `highway-mpg`. Omit the display of step information.

```[betahat_def,se_def,pval_def,finalmodel_def,stats_def] = stepwisefit(Xstepw,ystepw,'Display',"off"); inmodel = [false false false false false false false true]; [betahat_in,se_in,pval_in,finalmodel_in,stats_in] = stepwisefit(Xstepw,ystepw,'InModel',inmodel,'Display','off');```

Inspect the final models returned by `stepwisefit`.

`finalmodel_def`
```finalmodel_def = 1x8 logical array 1 0 1 1 0 1 1 0 ```
`finalmodel_in`
```finalmodel_in = 1x8 logical array 1 0 1 1 0 1 0 1 ```

The default model drops `highway-mpg` (term `8`) from the model and includes `city-mpg` (term `7`) instead. Compare the root mean squared errors (RMSEs) of these two final models.

`stats_def.rmse`
```ans = 3.3033e+03 ```
`stats_in.rmse`
```ans = 3.3324e+03 ```

The model resulting from the default arguments has a slightly lower RMSE. Note that a full specification of the final model consists of the term estimates plus the intercept estimate.

`betahat_def % Term estimates`
```betahat_def = 8×1 103 × 0.1559 -0.2242 -2.8578 0.3904 0.0222 0.0024 -0.2414 0.0793 ```
`stats_def.intercept % Intercept estimate`
```ans = -7.3506e+03 ```

Retrieve the history of the default run of `stepwisefit` and the recommended next step. Omit the display of step information.

```[~,~,~,~,~,nextstep_def,history_def]=stepwisefit(Xstepw,ystepw,'Display',"off"); nextstep_def```
```nextstep_def = 0 ```

No further steps are recommended (`nextstep_def` is `0`).

`history_def.('in') `
```ans = 7x8 logical array 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1 0 1 0 1 1 0 1 1 0 ```

The algorithm performs a total of seven steps. The output shows that `engine-size` (term `1`) is added in step `1`, `horsepower` (term `5`) is added in step `2`, and so on.

Input Arguments

collapse all

Predictor variables, specified as an n-by-p numeric matrix, where n is the number of observations and p is the number of predictor variables. Each column of `X` represents one variable, and each row represents one observation.

`stepwisefit` always includes a constant term in the model. Therefore, do not include a column of 1s in `X`.

Data Types: `single` | `double`

Response variable, specified as an n-by-1 numeric or logical vector, where n is the number of observations. Each entry in `y` is the response for the corresponding row of `X`.

Data Types: `single` | `double` | `logical`

Note

`stepwisefit` treats `NaN` values in either `X` or `y` as missing and ignores all rows containing these values.

Name-Value Arguments

Specify optional pairs of arguments as `Name1=Value1,...,NameN=ValueN`, where `Name` is the argument name and `Value` is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose `Name` in quotes.

Example: `'PEnter',0.10,'PRemove',0.15,'MaxIter',8` instructs `stepwisefit` to use entry and exit tolerances of `0.10` and `0.15`, respectively, and to take a maximum of `8` steps.

Terms for the initial model, specified as the comma-separated pair consisting of `'InModel'` and a logical vector specifying terms to include in the initial model. The default is to include no terms.

Example: ```'InModel',[true false false true]```

Data Types: `logical`

Tolerance for adding terms to the model, specified as the comma-separated pair consisting of `'PEnter'` and a positive scalar specifying the maximum p-value for a term to be added. The default is 0.05.

Example: `'PEnter',0.10`

Data Types: `single` | `double`

Tolerance for removing terms from the model, specified as the comma-separated pair consisting of` 'PRemove'` and a positive scalar specifying the minimum p-value for a term to be removed. The default is the maximum of `PEnter` and `0.10`.

Note

`PRemove` is not allowed to be smaller than `PEnter` because that would cause `stepwisefit` to enter an infinite loop, where a variable is repeatedly added to the model and removed from the model.

Example: `'PRemove',0.15`

Data Types: `single` | `double`

Indicator for displaying step information, specified as the comma-separated pair consisting of `'Display'` and `'on'` or `'off'`.

• `'on'` displays information about each step in the Command Window (default).

• `'off'` omits the display.

Example: `'Display','off'`

Maximum number of steps, specified as the comma-separated pair consisting of `'MaxIter'` and a positive integer or `Inf` (default). `Inf` allows the algorithm to run until no single step improves the model.

Example: `'MaxIter',12`

Data Types: `double`

Terms to keep in their initial state, specified as the comma-separated pair consisting of `'Keep'` and a logical vector. The value `true` for a term specified to be in (or out of) the initial model forces that term to remain in (or out of) the final model. The value `false` for a term does not force that term to remain in (or out of) the final model. The default is to specify no terms to keep in their initial state.

Example: ```'Keep',[true true false false]```

Data Types: `logical`

Indicator for centering and scaling terms, specified as the comma-separated pair consisting of `'Scale'` and `'off'` or `'on'`.

• `'off'` does not center and scale the terms (default).

• `'on'` centers and scales each column of `X` (computes the z-scores) before fitting.

Example: `'Scale','on'`

Output Arguments

collapse all

Estimated coefficients, returned as a numeric vector corresponding to the terms in `X`. The `stepwisefit` function calculates the values in `b` as follows:

• If a term is included in the final model, then its corresponding value in `b` is the estimate resulting from fitting the final model.

• If a term is excluded from the final model, then its corresponding value in `b` is the estimate resulting from fitting the final model plus that term.

Note

To obtain a full specification of the fitted model, you also need the estimated intercept in addition to `b`. The estimated intercept is provided as a field in the output argument `stats`. For more details see stepwisefit Fitted Model.

Standard errors, returned as a numeric vector corresponding to the estimates in `b`.

p-values, returned as a numeric vector that results from testing whether elements of `b` are `0`.

Final model, returned as a logical vector with length equal to the number of columns in `X`, indicating which terms are in the final model.

Additional statistics, returned as a structure with the following fields. All statistics pertain to the final model except where noted.

FieldDescription
`source`

Character vector `'stepwisefit'`

`dfe`

Degrees of freedom for error

`df0`

Degrees of freedom for the regression

`SStotal`

Total sum of squares of the response

`SSresid`

Sum of squares of the residuals

`fstat`

F-statistic for testing the final model vs. no model (mean only)

`pval`

p-value of the F-statistic

`rmse`

Root mean squared error

`xr`

Residuals for terms not in the final model, computed by subtracting from each term the predicted response of the final model

`yr`

Residuals for the response using predictors in the final model

`B`

Coefficients for terms in the final model, with the value for each term not in the model set to the value that would be obtained by adding that term to the model

`SE`

Standard errors for coefficient estimates

`TSTAT`

t statistics for coefficient estimates

`PVAL`

p-values for coefficient estimates

`intercept`

Estimated intercept

`wasnan`

Rows in the data that contain `NaN` values

Recommended next step, returned as a nonnegative integer equal to the index of the next term to add to or remove from the model, or `0` if no further steps are recommended.

Information on steps taken, returned as a structure with the following fields.

FieldDescription
`B`

Matrix of regression coefficients, where each column is one step and each row is one coefficient vector

`rmse`

Root mean squared errors for the model at each step

`df0`

Degrees of freedom for the regression at each step

`in`

Logical array indicating which predictors are in the model at each step, where each row is one step and each column is one predictor

collapse all

`stepwisefit` Fitted Model

The final `stepwisefit` fitted model is

`$\stackrel{^}{y}=\text{stats}\text{.intercept}+X\left(\text{:,finalmodel}\right)*b\left(\text{finalmodel}\right).$`

Here,

• $\stackrel{^}{y}$ is the predicted mean response.

• `stats.intercept` is the estimated intercept.

• X`(:,finalmodel)` is the input matrix for the terms in the final model.

• `b(finalmodel)` is the vector of coefficient estimates for the terms in the final model.

Algorithms

Stepwise regression is a method for adding terms to and removing terms from a multilinear model based on their statistical significance. This method begins with an initial model and then takes successive steps to modify the model by adding or removing terms. At each step, the p-value of an F-statistic is computed to test models with and without a potential term. If a term is not currently in the model, the null hypothesis is that the term would have a zero coefficient if added to the model. If there is sufficient evidence to reject the null hypothesis, the term is added to the model. Conversely, if a term is currently in the model, the null hypothesis is that the term has a zero coefficient. If there is insufficient evidence to reject the null hypothesis, the term is removed from the model. The method proceeds as follows:

1. Fit the initial model.

2. If any terms not in the model have p-values less than an entry tolerance, add the one with the smallest p-value and repeat this step. For example, assume the initial model is the default constant model and the entry tolerance is the default `0.05`. The algorithm first fits all models consisting of the constant plus another term and identifies the term that has the smallest p-value, for example term `4`. If the term `4` p-value is less than `0.05`, then term `4` is added to the model. Next, the algorithm performs a search among all models consisting of the constant, term `4`, and another term. If a term not in the model has a p-value less than `0.05`, the term with the smallest p-value is added to the model and the process is repeated. When no further terms exist that can be added to the model, the algorithm proceeds to step 3.

3. If any terms in the model have p-values greater than an exit tolerance, remove the one with the largest p-value and go to step 2; otherwise, end.

In each step of the algorithm, `stepwisefit` uses the method of least squares to estimate the model coefficients. After adding a term to the model at an earlier stage, the algorithm might subsequently drop that term if it is no longer helpful in combination with other terms added later. The method terminates when no single step improves the model. However, the final model is not guaranteed to be optimal, which means having the best fit to the data. A different initial model or a different sequence of steps might lead to a better fit. In this sense, stepwise models are locally optimal, but are not necessarily globally optimal.

References

[1] Draper, Norman R., and Harry Smith. Applied Regression Analysis. Hoboken, NJ: Wiley-Interscience, 1998. pp. 307–312.

Version History

Introduced before R2006a