|On this page…|
Feature selection reduces the dimensionality of data by selecting only a subset of measured features (predictor variables) to create a model. Selection criteria usually involve the minimization of a specific measure of predictive error for models fit to different subsets. Algorithms search for a subset of predictors that optimally model measured responses, subject to constraints such as required or excluded features and the size of the subset.
Feature selection is preferable to feature transformation when the original units and meaning of features are important and the modeling goal is to identify an influential subset. When categorical features are present, and numerical transformations are inappropriate, feature selection becomes the primary means of dimension reduction.
An objective function, called the criterion, which the method seeks to minimize over all feasible feature subsets. Common criteria are mean squared error (for regression models) and misclassification rate (for classification models).
A sequential search algorithm, which adds or removes features from a candidate subset while evaluating the criterion. Since an exhaustive comparison of the criterion value at all 2n subsets of an n-feature data set is typically infeasible (depending on the size of n and the cost of objective calls), sequential searches move in only one direction, always growing or always shrinking the candidate set.
The method has two variants:
Stepwise regression is a sequential feature selection technique designed specifically for least-squares fitting. The functions stepwise and stepwisefit make use of optimizations that are only possible with least-squares criteria. Unlike generalized sequential feature selection, stepwise regression may remove features that have been added or add features that have been removed.
The Statistics Toolbox™ function sequentialfs carries out sequential feature selection. Input arguments include predictor and response data and a function handle to a file implementing the criterion function. Optional inputs allow you to specify SFS or SBS, required or excluded features, and the size of the feature subset. The function calls cvpartition and crossval to evaluate the criterion at different candidate sets.
For example, consider a data set with 100 observations of 10 predictors. The following generates random data from a logistic model, with a binomial distribution of responses at each set of values for the predictors. Some coefficients are set to zero so that not all of the predictors affect the response:
n = 100; m = 10; X = rand(n,m); b = [1 0 0 2 .5 0 0 0.1 0 1]; Xb = X*b'; p = 1./(1+exp(-Xb)); N = 50; y = binornd(N,p);
The glmfit function fits a logistic model to the data:
Y = [y N*ones(size(y))]; [b0,dev0,stats0] = glmfit(X,Y,'binomial'); % Display coefficient estimates and their standard errors: model0 = [b0 stats0.se] model0 = 0.3115 0.2596 0.9614 0.1656 -0.1100 0.1651 -0.2165 0.1683 1.9519 0.1809 0.5683 0.2018 -0.0062 0.1740 0.0651 0.1641 -0.1034 0.1685 0.0017 0.1815 0.7979 0.1806 % Display the deviance of the fit: dev0 dev0 = 101.2594
This is the full model, using all of the features (and an initial constant term). Sequential feature selection searches for a subset of the features in the full model with comparative predictive power.
First, you must specify a criterion for selecting the features. The following function, which calls glmfit and returns the deviance of the fit (a generalization of the residual sum of squares) is a useful criterion in this case:
function dev = critfun(X,Y) [b,dev] = glmfit(X,Y,'binomial');
You should create this function as a file on the MATLAB® path.
The function sequentialfs performs feature selection, calling the criterion function via a function handle:
maxdev = chi2inv(.95,1); opt = statset('display','iter',... 'TolFun',maxdev,... 'TolTypeFun','abs'); inmodel = sequentialfs(@critfun,X,Y,... 'cv','none',... 'nullmodel',true,... 'options',opt,... 'direction','forward'); Start forward sequential feature selection: Initial columns included: none Columns that can not be included: none Step 1, used initial columns, criterion value 309.118 Step 2, added column 4, criterion value 180.732 Step 3, added column 1, criterion value 138.862 Step 4, added column 10, criterion value 114.238 Step 5, added column 5, criterion value 103.503 Final columns included: 1 4 5 10
The iterative display shows a decrease in the criterion value as each new feature is added to the model. The final result is a reduced model with only four of the original ten features: columns 1, 4, 5, and 10 of X. These features are indicated in the logical vector inmodel returned by sequentialfs.
The deviance of the reduced model is higher than for the full model, but the addition of any other single feature would not decrease the criterion by more than the absolute tolerance, maxdev, set in the options structure. Adding a feature with no effect reduces the deviance by an amount that has a chi-square distribution with one degree of freedom. Adding a significant feature results in a larger change. By setting maxdev to chi2inv(.95,1), you instruct sequentialfs to continue adding features so long as the change in deviance is more than would be expected by random chance.
The reduced model (also with an initial constant term) is:
[b,dev,stats] = glmfit(X(:,inmodel),Y,'binomial'); % Display coefficient estimates and their standard errors: model = [b stats.se] model = 0.0784 0.1642 1.0040 0.1592 1.9459 0.1789 0.6134 0.1872 0.8245 0.1730