Introduction to Feature Selection

This topic provides an introduction to feature selection algorithms and describes the feature selection functions available in Statistics and Machine Learning Toolbox™.

Feature Selection Algorithms

Feature selection reduces the dimensionality of data by selecting only a subset of measured features (predictor variables) to create a model. Feature selection algorithms search for a subset of predictors that optimally models measured responses, subject to constraints such as required or excluded features and the size of the subset. The main benefits of feature selection are to improve prediction performance, provide faster and more cost-effective predictors, and provide a better understanding of the data generation process [1]. Using too many features can degrade prediction performance even when all features are relevant and contain information about the response variable.

You can categorize feature selection algorithms into three types:

Filter Type Feature Selection — The filter type feature selection algorithm measures feature importance based on the characteristics of the features, such as feature variance and feature relevance to the response. You select important features as part of a data preprocessing step and then train a model using the selected features. Therefore, filter type feature selection is uncorrelated to the training algorithm.
Wrapper Type Feature Selection — The wrapper type feature selection algorithm starts training using a subset of features and then adds or removes a feature using a selection criterion. The selection criterion directly measures the change in model performance that results from adding or removing a feature. The algorithm repeats training and improving a model until its stopping criteria are satisfied.
Embedded Type Feature Selection — The embedded type feature selection algorithm learns feature importance as part of the model learning process. Once you train a model, you obtain the importance of the features in the trained model. This type of algorithm selects features that work well with a particular learning process.

In addition, you can categorize feature selection algorithms according to whether or not an algorithm ranks features sequentially. The minimum redundancy maximum relevance (MRMR) algorithm and stepwise regression are two examples of the sequential feature selection algorithm. For details, see Sequential Feature Selection.

You can compare the importance of predictor variables visually by creating partial dependence plots (PDP) and individual conditional expectation (ICE) plots. For details, see plotPartialDependence.

For classification problems, after selecting features, you can train two models (for example, a full model and a model trained with a subset of predictors) and compare the accuracies of the models by using the compareHoldout, testcholdout, or testckfold functions.

Feature selection is preferable to feature transformation when the original features and their units are important and the modeling goal is to identify an influential subset. When categorical features are present, and numerical transformations are inappropriate, feature selection becomes the primary means of dimension reduction.

Feature Selection Functions

Statistics and Machine Learning Toolbox offers several functions for feature selection. Choose the appropriate feature selection function based on your problem and the data types of the features.

Filter Type Feature Selection

Function	Supported Problem	Supported Data Type	Description
`fscchi2`	Classification	Categorical and continuous features	Examine whether each predictor variable is independent of a response variable by using individual chi-square tests, and then rank features using the p-values of the chi-square test statistics. For examples, see the function reference page `fscchi2`.
`fscmrmr`	Classification	Categorical and continuous features	Rank features sequentially using the Minimum Redundancy Maximum Relevance (MRMR) Algorithm. For examples, see the function reference page `fscmrmr`.
`fscnca`*	Classification	Continuous features	Determine the feature weights by using a diagonal adaptation of neighborhood component analysis (NCA). This algorithm works best for estimating feature importance for distance-based supervised models that use pairwise distances between observations to predict the response. For details, see the function reference page `fscnca` and these topics: Neighborhood Component Analysis (NCA) Feature Selection Tune Regularization Parameter to Detect Features Using NCA for Classification
`fsrftest`	Regression	Categorical and continuous features	Examine the importance of each predictor individually using an F-test, and then rank features using the p-values of the F-test statistics. Each F-test tests the hypothesis that the response values grouped by predictor variable values are drawn from populations with the same mean against the alternative hypothesis that the population means are not all the same. For examples, see the function reference page `fsrftest`.
`fsrmrmr`	Regression	Categorical and continuous features	Rank features sequentially using the Minimum Redundancy Maximum Relevance (MRMR) Algorithm. For examples, see the function reference page `fsrmrmr`.
`fsrnca`*	Regression	Continuous features	Determine the feature weights by using a diagonal adaptation of neighborhood component analysis (NCA). This algorithm works best for estimating feature importance for distance-based supervised models that use pairwise distances between observations to predict the response. For details, see the function reference page `fsrnca` and these topics: Neighborhood Component Analysis (NCA) Feature Selection Robust Feature Selection Using NCA for Regression
`fsulaplacian`	Unsupervised learning	Continuous features	Rank features using the Laplacian Score. For examples, see the function reference page `fsulaplacian`.
`relieff`	Classification and regression	Either all categorical or all continuous features	Rank features using the ReliefF algorithm for classification and the RReliefF algorithm for regression. This algorithm works best for estimating feature importance for distance-based supervised models that use pairwise distances between observations to predict the response. For examples, see the function reference page `relieff`.
`sequentialfs`	Classification and regression	Either all categorical or all continuous features	Select features sequentially using a custom criterion. Define a function that measures the characteristics of data to select features, and pass the function handle to the `sequentialfs` function. You can specify sequential forward selection or sequential backward selection by using the `'Direction'` name-value pair argument. `sequentialfs` evaluates the criterion using cross-validation.

*You can also consider fscnca and fsrnca as embedded type feature selection functions because they return a trained model object and you can use the object functions predict and loss. However, you typically use these object functions to tune the regularization parameter of the algorithm. After selecting features using the fscnca or fsrnca function as part of a data preprocessing step, you can apply another classification or regression algorithm for your problem.

Wrapper Type Feature Selection

Function Supported Problem Supported Data Type Description

Function	Supported Problem	Supported Data Type	Description
`sequentialfs`	Classification and regression	Either all categorical or all continuous features	Select features sequentially using a custom criterion. Define a function that implements a supervised learning algorithm or a function that measures performance of a learning algorithm, and pass the function handle to the `sequentialfs` function. You can specify sequential forward selection or sequential backward selection by using the `'Direction'` name-value pair argument. `sequentialfs` evaluates the criterion using cross-validation. For examples, see the function reference page `sequentialfs` and these topics: Select Subset of Features with Comparative Predictive Power Select Features for Classifying High-Dimensional Data

sequentialfs

Classification and regression

Either all categorical or all continuous features

Select features sequentially using a custom criterion. Define a function that implements a supervised learning algorithm or a function that measures performance of a learning algorithm, and pass the function handle to the sequentialfs function. You can specify sequential forward selection or sequential backward selection by using the 'Direction' name-value pair argument. sequentialfs evaluates the criterion using cross-validation.

For examples, see the function reference page sequentialfs and these topics:

Embedded Type Feature Selection

Function	Supported Problem	Supported Data Type	Description
`DeltaPredictor` property of a `ClassificationDiscriminant` model object	Linear discriminant analysis classification	Continuous features	Create a linear discriminant analysis classifier by using `fitcdiscr`. A trained classifier, returned as `ClassificationDiscriminant`, stores the coefficient magnitude in the `DeltaPredictor` property. You can use the values in `DeltaPredictor` as measures of the predictor importance. This classifier uses the two regularization parameters Gamma and Delta to identify and remove redundant predictors. You can obtain appropriate values for these parameters by using the `cvshrink` function or the `'OptimizeHyperparameters'` name-value pair argument. For examples, see these topics: Regularize Discriminant Analysis Classifier Optimize Discriminant Analysis Model
`fitcecoc` with `templateLinear`	Linear classification for multiclass learning with high-dimensional data	Continuous features	Train a linear classification model by using `fitcecoc` and linear binary learners defined by `templateLinear`. Specify `'Regularization'` of `templatelinear` as `'lasso'` to use lasso regularization. For an example, see Find Good Lasso Penalty Using Cross-Validation. This example determines a good lasso-penalty strength by evaluating models with different strength values using `kfoldLoss`. You can also evaluate models using `kfoldEdge`, `kfoldMargin`, `edge`, `loss`, or `margin`.
`fitclinear`	Linear classification for binary learning with high-dimensional data	Continuous features	Train a linear classification model by using `fitclinear`. Specify `'Regularization'` of `fitclinear` as `'lasso'` to use lasso regularization. For an example, see Find Good Lasso Penalty Using Cross-Validated AUC. This example determines a good lasso-penalty strength by evaluating models with different strength values using the AUC values. Compute the cross-validated posterior class probabilities by using `kfoldPredict`, and compute the AUC values by using `rocmetrics`. You can also evaluate models using `kfoldEdge`, `kfoldLoss`, `kfoldMargin`, `edge`, `loss`, `margin`, or `predict`.
`fitrgp`	Regression	Categorical and continuous features	Train a Gaussian process regression (GPR) model by using `fitrgp`. Set the `KernelFunction` name-value pair argument to use automatic relevance determination (ARD). Available options are `'ardsquaredexponential'`, `'ardexponential'`, `'ardmatern32'`, `'ardmatern52'`, and `'ardrationalquadratic'`. Find the predictor weights by taking the exponential of the negative learned length scales, stored in the `KernelInformation` property. For examples, see these topics: Specify Initial Step Size for LBFGS Optimization Compare NCA and ARD Feature Selection
`fitrlinear`	Linear regression with high-dimensional data	Continuous features	Train a linear regression model by using `fitrlinear`. Specify `'Regularization'` of `fitrlinear` as `'lasso'` to use lasso regularization. For examples, see these topics: Find Good Lasso Penalty Using Regression Loss Find Good Lasso Penalty Using Cross-Validation
`lasso`	Linear regression	Continuous features	Train a linear regression model with Lasso regularization by using `lasso`. You can specify the weight of lasso versus ridge optimization by using the `'Alpha'` name-value pair argument. For examples, see the function reference page `lasso` and these topics: Lasso Regularization Lasso and Elastic Net with Cross Validation Wide Data via Lasso and Parallel Computing
`lassoglm`	Generalized linear regression	Continuous features	Train a generalized linear regression model with Lasso regularization by using `lassoglm`. You can specify the weight of lasso versus ridge optimization by using the `'Alpha'` name-value pair argument. For details, see the function reference page `lassoglm` and these topics: Lasso Regularization of Generalized Linear Models Regularize Poisson Regression Regularize Logistic Regression Regularize Wide Data in Parallel
`oobPermutedPredictorImportance`** of `ClassificationBaggedEnsemble`	Classification with an ensemble of bagged decision trees (for example, random forest)	Categorical and continuous features	Train a bagged classification ensemble with tree learners by using `fitcensemble` and specifying `'Method'` as `'bag'`. Then, use `oobPermutedPredictorImportance` to compute Out-of-Bag, Predictor Importance Estimates by Permutation. The function measures how influential the predictor variables in the model are at predicting the response. For examples, see the function reference page and the topic `oobPermutedPredictorImportance`.
`oobPermutedPredictorImportance`** of `RegressionBaggedEnsemble`	Regression with an ensemble of bagged decision trees (for example, random forest)	Categorical and continuous features	Train a bagged regression ensemble with tree learners by using `fitrensemble` and specifying `'Method'` as `'bag'`. Then, use `oobPermutedPredictorImportance` to compute Out-of-Bag, Predictor Importance Estimates by Permutation. The function measures how influential the predictor variables in the model are at predicting the response. For examples, see the function reference page `oobPermutedPredictorImportance` and Select Predictors for Random Forests.
`permutationImportance` of specific classification and regression models (`Mdl`)	Classification and regression	Categorical and continuous features	Train a classification or regression model. Then, use `permutationImportance` to compute the Permutation Predictor Importance for each model predictor. The function measures how influential the model's predictor variables are in predicting the response. For examples, see the function reference page `permutationImportance`.
`predictorImportance`** of `ClassificationEnsemble`	Classification with an ensemble of decision trees	Categorical and continuous features	Train a classification ensemble with tree learners by using `fitcensemble`. Then, use `predictorImportance` to compute estimates of Predictor Importance for the ensemble by summing changes in the risk due to splits on every predictor and dividing the sum by the number of branch nodes. For examples, see the function reference page `predictorImportance`.
`predictorImportance`** of `ClassificationTree`	Classification with a decision tree	Categorical and continuous features	Train a classification tree by using `fitctree`. Then, use `predictorImportance` to compute estimates of Predictor Importance for the tree by summing changes in the risk due to splits on every predictor and dividing the sum by the number of branch nodes. For examples, see the function reference page `predictorImportance`.
`predictorImportance`** of `RegressionEnsemble`	Regression with an ensemble of decision trees	Categorical and continuous features	Train a regression ensemble with tree learners by using `fitrensemble`. Then, use `predictorImportance` to compute estimates of Predictor Importance for the ensemble by summing changes in the risk due to splits on every predictor and dividing the sum by the number of branch nodes. For examples, see the function reference page `predictorImportance`.
`predictorImportance`** of `RegressionTree`	Regression with a decision tree	Categorical and continuous features	Train a regression tree by using `fitrtree`. Then, use `predictorImportance` to compute estimates of Predictor Importance for the tree by summing changes in the mean squared error (MSE) due to splits on every predictor and dividing the sum by the number of branch nodes. For examples, see the function reference page `predictorImportance`.
`stepwiseglm`***	Generalized linear regression	Categorical and continuous features	Fit a generalized linear regression model using stepwise regression by using `stepwiseglm`. Alternatively, you can fit a linear regression model by using `fitglm` and then adjust the model by using `step`. Stepwise regression is a systematic method for adding and removing terms from the model based on their statistical significance in explaining the response variable. For details, see the function reference page `stepwiseglm` and these topics: Generalized Linear Model Using Stepwise Algorithm Generalized Linear Models Generalized Linear Model Workflow
`stepwiselm`***	Linear regression	Categorical and continuous features	Fit a linear regression model using stepwise regression by using `stepwiselm`. Alternatively, you can fit a linear regression model by using `fitlm` and then adjust the model by using `step`. Stepwise regression is a systematic method for adding and removing terms from the model based on their statistical significance in explaining the response variable. For details, see the function reference page `stepwiselm` and these topics: Stepwise Regression Linear Regression with Interaction Effects Assess Significance of Regression Coefficients Using t-statistic

**For a tree-based algorithm, specify 'PredictorSelection' as 'interaction-curvature' to use the interaction test for selecting the best split predictor. The interaction test is useful in identifying important variables in the presence of many irrelevant variables. Also, if the training data includes many predictors, then specify 'NumVariablesToSample' as 'all' for training. Otherwise, the software might not select some predictors, underestimating their importance. For details, see fitctree, fitrtree, and templateTree.

***stepwiseglm and stepwiselm are not wrapper type functions because you cannot use them as a wrapper for another training function. However, these two functions use the wrapper type algorithm to find important features.

References

[1] Guyon, Isabelle, and A. Elisseeff. "An introduction to variable and feature selection." Journal of Machine Learning Research. Vol. 3, 2003, pp. 1157–1182.