Regularization techniques are used to prevent statistical overfitting in a predictive model. Regularization algorithms typically work by applying either a penalty for complexity such as by adding the coefficients of the model into the minimization or including a roughness penalty. By introducing additional information into the model, regularization algorithms can deal with multicollinearity and redundant predictors by making the model more parsimonious and accurate.
Popular regularization techniques include ridge regression (also known as Tikhonov regularization), lasso and elastic net algorithms, method of shrunken centroids, as well as trace plots and cross-validated mean square error. You can also apply Akaike Information Criteria (AIC) as a goodness-of-fit metric.
Each regularization technique offers advantages for certain use cases.
- Lasso uses an L1 norm and tends to force individual coefficient values completely towards zero. As a result, lasso works very well as a feature selection algorithm. It quickly identifies a small number of key variables.
- Ridge regression uses an L2 norm for the coefficients (you're minimizing the sum of the squared errors). Ridge regression tends to spread coefficient shrinkage across a larger number of coefficients. If you think that your model should contain a large number of coefficients, ridge regression is probably a good technique.
- Elastic net can compensate for lasso’s inability to identify additional predictors.
Regularization is related to feature selection in that it forces a model to use fewer predictors. Regularization methods have some distinct advantages.
- Regularization techniques are able to operate on much larger datasets than most feature selection methods (except for univariate feature selection). Lasso and ridge regression can be applied to datasets that contains thousands, even tens of thousands, of variables.
- Regularization algorithms often generate more accurate predictive models than feature selection. Regularization operates over a continuous space while feature selection operates over a discrete space. As a result, regularization is often able to fine-tune the model and produce more accurate estimates.
However, feature selection methods also have advantages:
- Feature selection is somewhat more intuitive and easier to explain to third parties. This is valuable when you have to describe your methods when sharing your results.
- MATLAB® and Statistics and Machine Learning Toolbox™ support all popular regularization techniques, and is available for linear regression, logistic regression, support vector machines, and linear discriminant analysis. If you're working with other model types like boosted decision tree, you need to apply feature selection.
- Regularization is used (alongside feature selection) to prevent statistical overfitting in a predictive model.
- Since regularization operates over a continuous space it can outperform discrete feature selection for machine learning problems that lend themselves to various kinds of linear modeling.
Let's assume that you are running a cancer research study. You have gene sequences for 500 different cancer patients and you're trying to determine which of 15,000 different genes have a signficant impact on the progression of the disease. You could apply one of the feature ranking methods like minimum redundancy maximum relevance and neighborhood component analysis, or univariate if you’re concerned about runtime; only sequential feature selection is completely impractical with this many different variables. Alternatively you can explore models with regularization. You can't use ridge regression because it won't force coefficients completely to zero quickly enough. At the same time, you can't use lasso since you might need to identify more than 500 different genes. The elastic net is one possible solution.