*Feature selection* reduces the dimensionality
of data by selecting only a subset of measured features (predictor
variables) to create a model. Selection criteria usually involve the
minimization of a specific measure of predictive error for models
fit to different subsets. Algorithms search for a subset of predictors
that optimally model measured responses, subject to constraints such
as required or excluded features and the size of the subset.

Feature selection is preferable to feature transformation when the original units and meaning of features are important and the modeling goal is to identify an influential subset. When categorical features are present, and numerical transformations are inappropriate, feature selection becomes the primary means of dimension reduction.

A common method of feature selection is *sequential
feature selection*. This method has two components:

An objective function, called the

*criterion*, which the method seeks to minimize over all feasible feature subsets. Common criteria are mean squared error (for regression models) and misclassification rate (for classification models).A sequential search algorithm, which adds or removes features from a candidate subset while evaluating the criterion. Since an exhaustive comparison of the criterion value at all 2

^{n}subsets of an*n*-feature data set is typically infeasible (depending on the size of*n*and the cost of objective calls), sequential searches move in only one direction, always growing or always shrinking the candidate set.

The method has two variants:

*Sequential forward selection*(*SFS*), in which features are sequentially added to an empty candidate set until the addition of further features does not decrease the criterion.*Sequential backward selection*(*SBS*), in which features are sequentially removed from a full candidate set until the removal of further features increase the criterion.

Stepwise regression is a sequential feature selection technique
designed specifically for least-squares fitting. The functions `stepwise`

and `stepwisefit`

make
use of optimizations that are only possible with least-squares criteria.
Unlike generalized sequential feature selection, stepwise regression
may remove features that have been added or add features that have
been removed.

The Statistics and Machine Learning Toolbox™ function `sequentialfs`

carries out sequential
feature selection. Input arguments include predictor data, response data, and a
function handle to a file implementing the criterion function. Optional inputs allow
you to specify SFS or SBS, required or excluded features, and the size of the
feature subset. The function calls `cvpartition`

and `crossval`

to evaluate the criterion
at different candidate sets.

Consider a data set with 100 observations of 10 predictors. Generate the random data from a logistic model, with a binomial distribution of responses at each set of values for the predictors. Some coefficients are set to zero so that not all of the predictors affect the response.

```
rng(456) % Set the seed for reproducibility
n = 100;
m = 10;
X = rand(n,m);
b = [1 0 0 2 .5 0 0 0.1 0 1];
Xb = X*b';
p = 1./(1+exp(-Xb));
N = 50;
y = binornd(N,p);
```

Fit a logistic model to the data using `fitglm`

.

Y = [y N*ones(size(y))]; model0 = fitglm(X,Y,'Distribution','binomial')

model0 = Generalized linear regression model: logit(y) ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue _________ _______ ________ __________ (Intercept) 0.22474 0.30043 0.74806 0.45443 x1 0.68782 0.17207 3.9973 6.408e-05 x2 0.2003 0.18087 1.1074 0.26811 x3 -0.055328 0.18871 -0.29319 0.76937 x4 2.2576 0.1813 12.452 1.3566e-35 x5 0.54603 0.16836 3.2432 0.0011821 x6 0.069701 0.17738 0.39294 0.69437 x7 -0.22562 0.16957 -1.3306 0.18334 x8 -0.19712 0.17317 -1.1383 0.25498 x9 -0.20373 0.16796 -1.213 0.22514 x10 0.99741 0.17247 5.7832 7.3296e-09 100 observations, 89 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 222, p-value = 4.92e-42

Display the deviance of the fit.

dev0 = model0.Deviance

dev0 = 101.5648

This model is the full model, with all of the features and an initial constant term. Sequential feature selection searches for a subset of the features in the full model with comparative predictive power.

Before performing feature selection, you must specify a criterion for selecting the features. In this case, the criterion is the deviance of the fit (a generalization of the residual sum of squares). The `critfun`

function (shown at the end of this example) calls `fitglm`

and returns the deviance of the fit.

If you use the live script file for this example, the `critfun`

function is already included at the end of the file. Otherwise, you need to create this function at the end of your .m file or add it as a file on the MATLAB path.

Perform feature selection. `sequentialfs`

calls the criterion function via a function handle.

maxdev = chi2inv(.95,1); opt = statset('display','iter',... 'TolFun',maxdev,... 'TolTypeFun','abs'); inmodel = sequentialfs(@critfun,X,Y,... 'cv','none',... 'nullmodel',true,... 'options',opt,... 'direction','forward');

Start forward sequential feature selection: Initial columns included: none Columns that can not be included: none Step 1, used initial columns, criterion value 323.173 Step 2, added column 4, criterion value 184.794 Step 3, added column 10, criterion value 139.176 Step 4, added column 1, criterion value 119.222 Step 5, added column 5, criterion value 107.281 Final columns included: 1 4 5 10

The iterative display shows a decrease in the criterion value as each new feature is added to the model. The final result is a reduced model with only four of the original ten features: columns `1`

, `4`

, `5`

, and `10`

of `X`

, as indicated in the logical vector `inmodel`

returned by `sequentialfs`

.

The deviance of the reduced model is higher than the deviance of the full model. However, the addition of any other single feature would not decrease the criterion value by more than the absolute tolerance, `maxdev`

, set in the options structure. Adding a feature with no effect reduces the deviance by an amount that has a chi-square distribution with one degree of freedom. Adding a significant feature results in a larger change in the deviance. By setting `maxdev`

to `chi2inv(.95,1)`

, you instruct `sequentialfs`

to continue adding features provided that the change in deviance is more than the change expected by random chance.

Create the reduced model with an initial constant term.

model = fitglm(X(:,inmodel),Y,'Distribution','binomial')

model = Generalized linear regression model: logit(y) ~ 1 + x1 + x2 + x3 + x4 Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue __________ _______ _________ __________ (Intercept) -0.0052025 0.16772 -0.031018 0.97525 x1 0.73814 0.16316 4.5241 6.0666e-06 x2 2.2139 0.17402 12.722 4.4369e-37 x3 0.54073 0.1568 3.4485 0.00056361 x4 1.0694 0.15916 6.7191 1.8288e-11 100 observations, 95 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 216, p-value = 1.44e-45

This code creates the function `critfun`

.

function dev = critfun(X,Y) model = fitglm(X,Y,'Distribution','binomial'); dev = model.Deviance; end