Lasso is a regularization technique. Use `lasso`

to:

Reduce the number of predictors in a regression model.

Identify important predictors.

Select among redundant predictors.

Produce shrinkage estimates with potentially lower predictive errors than ordinary least squares.

Elastic net is a related technique. Use elastic net when you
have several highly correlated variables. `lasso`

provides
elastic net regularization when you set the `Alpha`

name-value
pair to a number strictly between `0`

and `1`

.

See Lasso and Elastic Net Details.

For lasso regularization of regression ensembles, see `regularize`

.

To see how `lasso`

identifies
and discards unnecessary predictors:

Generate

`200`

samples of five-dimensional artificial data`X`

from exponential distributions with various means:rng(3,'twister') % for reproducibility X = zeros(200,5); for ii = 1:5 X(:,ii) = exprnd(ii,200,1); end

Generate response data

`Y = X*r + eps`

where`r`

has just two nonzero components, and the noise`eps`

is normal with standard deviation`0.1`

:r = [0;2;0;-3;0]; Y = X*r + randn(200,1)*.1;

Fit a cross-validated sequence of models with

`lasso`

, and plot the result:[b fitinfo] = lasso(X,Y,'CV',10); lassoPlot(b,fitinfo,'PlotType','Lambda','XScale','log');

The plot shows the nonzero coefficients in the regression for various values of the

`Lambda`

regularization parameter. Larger values of`Lambda`

appear on the left side of the graph, meaning more regularization, resulting in fewer nonzero regression coefficients.The dashed vertical lines represent the

`Lambda`

value with minimal mean squared error (on the right), and the`Lambda`

value with minimal mean squared error plus one standard deviation. This latter value is a recommended setting for`Lambda`

. These lines appear only when you perform cross validation. Cross validate by setting the`'CV'`

name-value pair. This example uses 10-fold cross validation.The upper part of the plot shows the degrees of freedom (df), meaning the number of nonzero coefficients in the regression, as a function of Lambda. On the left, the large value of Lambda causes all but one coefficient to be 0. On the right all five coefficients are nonzero, though the plot shows only two clearly. The other three coefficients are so small that you cannot visually distinguish them from 0.

For small values of

`Lambda`

(toward the right in the plot), the coefficient values are close to the least-squares estimate. See step 5.Find the

`Lambda`

value of the minimal cross-validated mean squared error plus one standard deviation. Examine the MSE and coefficients of the fit at that`Lambda`

:lam = fitinfo.Index1SE; fitinfo.MSE(lam) ans = 0.1398 b(:,lam) ans = 0 1.8855 0 -2.9367 0

`lasso`

did a good job finding the coefficient vector`r`

.For comparison, find the least-squares estimate of

`r`

:rhat = X\Y rhat = -0.0038 1.9952 0.0014 -2.9993 0.0031

The estimate

`b(:,lam)`

has slightly more mean squared error than the mean squared error of`rhat`

:res = X*rhat - Y; % calculate residuals MSEmin = res'*res/200 % b(:,lam) value is 0.1398 MSEmin = 0.0088

But

`b(:,lam)`

has only two nonzero components, and therefore can provide better predictive estimates on new data.

Consider predicting the mileage (`MPG`

) of
a car based on its weight, displacement, horsepower, and acceleration.
The `carbig`

data contains these measurements. The
data seem likely to be correlated, making elastic net an attractive
choice.

Load the data:

load carbig

Extract the continuous (noncategorical) predictors (

`lasso`

does not handle categorical predictors):X = [Acceleration Displacement Horsepower Weight];

Perform a lasso fit with 10-fold cross validation:

[b fitinfo] = lasso(X,MPG,'CV',10);

Plot the result:

lassoPlot(b,fitinfo,'PlotType','Lambda','XScale','log');

Calculate the correlation of the predictors:

% Eliminate NaNs so corr runs nonan = ~any(isnan([X MPG]),2); Xnonan = X(nonan,:); MPGnonan = MPG(nonan,:); corr(Xnonan) ans = 1.0000 -0.5438 -0.6892 -0.4168 -0.5438 1.0000 0.8973 0.9330 -0.6892 0.8973 1.0000 0.8645 -0.4168 0.9330 0.8645 1.0000

Because some predictors are highly correlated, perform elastic net fitting. Use

`Alpha`

=`0.5`

:[ba fitinfoa] = lasso(X,MPG,'CV',10,'Alpha',.5);

Plot the result. Name each predictor so you can tell which curve is which:

pnames = {'Acceleration','Displacement',... 'Horsepower','Weight'}; lassoPlot(ba,fitinfoa,'PlotType','Lambda',... 'XScale','log','PredictorNames',pnames);

When you activate the data cursor

and click the plot, you see the name of the predictor, the coefficient, the value of

`Lambda`

, and the index of that point, meaning the column in`b`

associated with that fit.

Here, the elastic net and lasso results are not very similar.
Also, the elastic net plot reflects a notable qualitative property
of the elastic net technique. The elastic net retains three nonzero
coefficients as `Lambda`

increases (toward the left
of the plot), and these three coefficients reach `0`

at
about the same `Lambda`

value. In contrast, the lasso
plot shows two of the three coefficients becoming `0`

at
the same value of `Lambda`

, while another coefficient
remains nonzero for higher values of `Lambda`

.

This behavior exemplifies a general pattern. In general, elastic
net tends to retain or drop groups of highly correlated predictors
as `Lambda`

increases. In contrast, lasso tends to
drop smaller groups, or even individual predictors.

Lasso and elastic net are especially well suited to *wide* data,
meaning data with more predictors than observations. Obviously, there
are redundant predictors in this type of data. Use `lasso`

along with cross validation to identify
important predictors.

Cross validation can be slow. If you have a Parallel Computing Toolbox™ license, speed the computation using parallel computing.

Load the

`spectra`

data:load spectra Description Description = == Spectral and octane data of gasoline == NIR spectra and octane numbers of 60 gasoline samples NIR: NIR spectra, measured in 2 nm intervals from 900 nm to 1700 nm octane: octane numbers spectra: a dataset array containing variables for NIR and octane Reference: Kalivas, John H., "Two Data Sets of Near Infrared Spectra," Chemometrics and Intelligent Laboratory Systems, v.37 (1997) pp.255-259

Compute the default

`lasso`

fit:[b fitinfo] = lasso(NIR,octane);

Plot the number of predictors in the fitted lasso regularization as a function of

`Lambda`

, using a logarithmic*x*-axis:lassoPlot(b,fitinfo,'PlotType','Lambda','XScale','log');

It is difficult to tell which value of

`Lambda`

is appropriate. To determine a good value, try fitting with cross validation:tic [b fitinfo] = lasso(NIR,octane,'CV',10); % A time-consuming operation toc Elapsed time is 226.876926 seconds.

Plot the result:

lassoPlot(b,fitinfo,'PlotType','Lambda','XScale','log');

You can see the suggested value of

`Lambda`

is over`1e-2`

, and the`Lambda`

with minimal MSE is under`1e-2`

. These values are in the`fitinfo`

structure:fitinfo.LambdaMinMSE ans = 0.0057 fitinfo.Lambda1SE ans = 0.0190

Examine the quality of the fit for the suggested value of

`Lambda`

:lambdaindex = fitinfo.Index1SE; fitinfo.MSE(lambdaindex) ans = 0.0532 fitinfo.DF(lambdaindex) ans = 11

The fit uses just 11 of the 401 predictors, and achieves a cross-validated MSE of

`0.0532`

.Examine the plot of cross-validated MSE:

lassoPlot(b,fitinfo,'PlotType','CV'); % Use a log scale for MSE to see small MSE values better set(gca,'YScale','log');

As

`Lambda`

increases (toward the left), MSE increases rapidly. The coefficients are reduced too much and they do not adequately fit the responses.As

`Lambda`

decreases, the models are larger (have more nonzero coefficients). The increasing MSE suggests that the models are overfitted.The default set of

`Lambda`

values does not include values small enough to include all predictors. In this case, there does not appear to be a reason to look at smaller values. However, if you want smaller values than the default, use the`LambdaRatio`

parameter, or supply a sequence of`Lambda`

values using the`Lambda`

parameter. For details, see the`lasso`

reference page.To compute the cross-validated lasso estimate faster, use parallel computing (available with a Parallel Computing Toolbox license):

parpool() Starting parpool using the 'local' profile ... connected to 2 workers. ans = Pool with properties: AttachedFiles: {0x1 cell} NumWorkers: 2 IdleTimeout: 30 Cluster: [1x1 parallel.cluster.Local] RequestQueue: [1x1 parallel.RequestQueue] SpmdEnabled: 1 opts = statset('UseParallel',true); tic; [b fitinfo] = lasso(NIR,octane,'CV',10,'Options',opts); toc Elapsed time is 114.712260 seconds.

Computing in parallel using two workers is faster on this problem.

Lasso is a regularization technique for performing linear regression.
Lasso includes a penalty term that constrains the size of the estimated
coefficients. Therefore, it resembles ridge regression. Lasso
is a *shrinkage estimator*: it generates coefficient
estimates that are biased to be small. Nevertheless, a lasso estimator
can have smaller mean squared error than an ordinary least-squares
estimator when you apply it to new data.

Unlike ridge regression, as the penalty term increases, lasso sets more coefficients to zero. This means that the lasso estimator is a smaller model, with fewer predictors. As such, lasso is an alternative to stepwise regression and other model selection and dimensionality reduction techniques.

Elastic net is a related technique. Elastic net is a hybrid of ridge regression and lasso regularization. Like lasso, elastic net can generate reduced models by generating zero-valued coefficients. Empirical studies have suggested that the elastic net technique can outperform lasso on data with highly correlated predictors.

The *lasso* technique solves this regularization
problem. For a given value of *λ*, a nonnegative
parameter, `lasso`

solves the problem

$$\underset{{\beta}_{0},\beta}{\mathrm{min}}\left(\frac{1}{2N}{\displaystyle \sum _{i=1}^{N}{\left({y}_{i}-{\beta}_{0}-{x}_{i}^{T}\beta \right)}^{2}}+\lambda {\displaystyle \sum _{j=1}^{p}\left|{\beta}_{j}\right|}\right),$$

where

*N*is the number of observations.*y*is the response at observation_{i}*i*.*x*is data, a vector of_{i}*p*values at observation*i*.*λ*is a positive regularization parameter corresponding to one value of`Lambda`

.The parameters

*β*_{0}and*β*are scalar and*p*-vector respectively.

As *λ* increases, the number of nonzero
components of *β* decreases.

The lasso problem involves the *L*^{1} norm
of *β*, as contrasted with the elastic net
algorithm.

The *elastic net* technique solves this
regularization problem. For an *α* strictly
between 0 and 1, and a nonnegative *λ*, elastic
net solves the problem

$$\underset{{\beta}_{0},\beta}{\mathrm{min}}\left(\frac{1}{2N}{\displaystyle \sum _{i=1}^{N}{\left({y}_{i}-{\beta}_{0}-{x}_{i}^{T}\beta \right)}^{2}}+\lambda {P}_{\alpha}\left(\beta \right)\right),$$

where

$${P}_{\alpha}\left(\beta \right)=\frac{(1-\alpha )}{2}{\Vert \beta \Vert}_{2}^{2}+\alpha {\Vert \beta \Vert}_{1}={\displaystyle \sum _{j=1}^{p}\left(\frac{(1-\alpha )}{2}{\beta}_{j}^{2}+\alpha \left|{\beta}_{j}\right|\right)}.$$

Elastic net is the same as lasso when *α* = 1. As *α* shrinks
toward 0, elastic net approaches `ridge`

regression.
For other values of *α*, the penalty term *P _{α}*(

[1] Tibshirani, R. *Regression shrinkage
and selection via the lasso.* Journal of the Royal Statistical
Society, Series B, Vol 58, No. 1, pp. 267–288, 1996.

[2] Zou, H. and T. Hastie. *Regularization
and variable selection via the elastic net.* Journal of
the Royal Statistical Society, Series B, Vol. 67, No. 2, pp. 301–320,
2005.

[3] Friedman, J., R. Tibshirani, and T. Hastie. *Regularization
paths for generalized linear models via coordinate descent.* Journal
of Statistical Software, Vol 33, No. 1, 2010. `http://www.jstatsoft.org/v33/i01`

[4] Hastie, T., R. Tibshirani, and J. Friedman. *The
Elements of Statistical Learning,* 2nd edition. Springer,
New York, 2008.

Was this topic helpful?