Lasso is a regularization technique. Use `lassoglm`

to:

Reduce the number of predictors in a generalized linear model.

Identify important predictors.

Select among redundant predictors.

Produce shrinkage estimates with potentially lower predictive errors than ordinary least squares.

Elastic net is a related technique. Use it when you
have several highly correlated variables. `lassoglm`

provides
elastic net regularization when you set the `Alpha`

name-value
pair to a number strictly between `0`

and `1`

.

For details about lasso and elastic net computations and algorithms, see Generalized Linear Model Lasso and Elastic Net. For a discussion of generalized linear models, see What Are Generalized Linear Models?.

This example shows how to identify and remove redundant predictors from a generalized linear model.

Create data with 20 predictors, and Poisson responses using just three of the predictors, plus a constant.

rng('default') % for reproducibility X = randn(100,20); mu = exp(X(:,[5 10 15])*[.4;.2;.3] + 1); y = poissrnd(mu);

Construct a cross-validated lasso regularization of a Poisson regression model of the data.

[B FitInfo] = lassoglm(X,y,'poisson','CV',10);

Examine the cross-validation plot to see the effect of the `Lambda`

regularization parameter.

lassoPlot(B,FitInfo,'plottype','CV');

The green circle and dashed line locate the `Lambda`

with minimal cross-validation error. The blue circle and dashed line locate the point with minimal cross-validation error plus one standard deviation.

Find the nonzero model coefficients corresponding to the two identified points.

minpts = find(B(:,FitInfo.IndexMinDeviance))

minpts = 3 5 6 10 11 15 16

min1pts = find(B(:,FitInfo.Index1SE))

min1pts = 5 10 15

The coefficients from the minimal plus one standard error point are exactly those coefficients used to create the data.

Find the values of the model coefficients at the minimal plus one standard error point.

B(min1pts,FitInfo.Index1SE)

ans = 0.2903 0.0789 0.2081

The values of the coefficients are, as expected, smaller than the original `[0.4,0.2,0.3]`

. Lasso works by "shrinkage," which biases predictor coefficients toward zero.

The constant term is in the `FitInfo.Intercept`

vector.

FitInfo.Intercept(FitInfo.Index1SE)

ans = 1.0879

The constant term is near 1, which is the value used to generate the data.

This example shows how to regularize binomial regression. The default (canonical) link function for binomial regression is the logistic function.

**Step 1. Prepare the data.**

Load the `ionosphere`

data. The response `Y`

is a cell array of `'g'`

or `'b'`

strings. Convert the cells to logical values, with `true`

representing `'g'`

. Remove the first two columns of `X`

because they have some awkward statistical properties, which are beyond the scope of this discussion.

load ionosphere Ybool = strcmp(Y,'g'); X = X(:,3:end);

**Step 2. Create a cross-validated fit.**

Construct a regularized binomial regression using 25 `Lambda`

values and 10-fold cross validation. This process can take a few minutes.

rng('default') % for reproducibility [B,FitInfo] = lassoglm(X,Ybool,'binomial',... 'NumLambda',25,'CV',10);

**Step 3. Examine plots to find appropriate regularization.**

`lassoPlot`

can give both a standard trace plot and a cross-validated deviance plot. Examine both plots.

lassoPlot(B,FitInfo,'PlotType','CV');

The plot identifies the minimum-deviance point with a green circle and dashed line as a function of the regularization parameter `Lambda`

. The blue circled point has minimum deviance plus no more than one standard deviation.

lassoPlot(B,FitInfo,'PlotType','Lambda','XScale','log');

The trace plot shows nonzero model coefficients as a function of the regularization parameter `Lambda`

. Because there are 32 predictors and a linear model, there are 32 curves. As `Lambda`

increases to the left, `lassoglm`

sets various coefficients to zero, removing them from the model.

The trace plot is somewhat compressed. Zoom in to see more detail.

xlim([.01 .1]) ylim([-3 3])

As `Lambda`

increases toward the left side of the plot, fewer nonzero coefficients remain.

Find the number of nonzero model coefficients at the `Lambda`

value with minimum deviance plus one standard deviation point. The regularized model coefficients are in column `FitInfo.Index1SE`

of the `B`

matrix.

indx = FitInfo.Index1SE; B0 = B(:,indx); nonzeros = sum(B0 ~= 0)

nonzeros = 14

When you set `Lambda`

to `FitInfo.Index1SE`

, `lassoglm`

removes over half of the 32 original predictors.

**Step 4. Create a regularized model.**

The constant term is in the `FitInfo.Index1SE`

entry of the `FitInfo.Intercept`

vector. Call that value `cnst`

.

The model is logit(mu) = log(mu/(1 - mu)) = `X*B0 + cnst`

. Therefore, for predictions, mu = `exp(X*B0 + cnst)/(1+exp(x*B0 + cnst))`

.

The `glmval`

function evaluates model predictions. It assumes that the first model coefficient relates to the constant term. Therefore, create a coefficient vector with the constant term first.

cnst = FitInfo.Intercept(indx); B1 = [cnst;B0];

**Step 5. Examine residuals.**

Plot the training data against the model predictions for the regularized `lassoglm`

model.

preds = glmval(B1,X,'logit'); histogram(Ybool - preds) % plot residuals title('Residuals from lassoglm model')

**Step 6. Alternative: Use identified predictors in a least-squares generalized linear model.**

Instead of using the biased predictions from the model, you can make an unbiased model using just the identified predictors.

predictors = find(B0); % indices of nonzero predictors mdl = fitglm(X,Ybool,'linear',... 'Distribution','binomial','PredictorVars',predictors)

mdl = Generalized Linear regression model: y ~ [Linear formula with 15 terms in 14 predictors] Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue _________ _______ ________ __________ (Intercept) -2.9367 0.50926 -5.7666 8.0893e-09 x1 2.492 0.60795 4.099 4.1502e-05 x3 2.5501 0.63304 4.0284 5.616e-05 x4 0.48816 0.50336 0.9698 0.33215 x5 0.6158 0.62192 0.99015 0.3221 x6 2.294 0.5421 4.2317 2.3198e-05 x7 0.77842 0.57765 1.3476 0.1778 x12 1.7808 0.54316 3.2786 0.0010432 x16 -0.070993 0.50515 -0.14054 0.88823 x20 -2.7767 0.55131 -5.0365 4.7402e-07 x24 2.0212 0.57639 3.5067 0.00045372 x25 -2.3796 0.58274 -4.0835 4.4363e-05 x27 0.79564 0.55904 1.4232 0.15467 x29 1.2689 0.55468 2.2876 0.022162 x32 -1.5681 0.54336 -2.8859 0.0039035 351 observations, 336 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 262, p-value = 1e-47

Plot the residuals of the model.

plotResiduals(mdl)

As expected, residuals from the least-squares model are slightly smaller than those of the regularized model. However, this does not mean that `mdl`

is a better predictor for new data.

This example shows how to regularize
a model with many more predictors than observations. *Wide
data* is data with more predictors than observations. Typically,
with wide data you want to identify important predictors. Use `lassoglm`

as
an exploratory or screening tool to select a smaller set of variables
to prioritize your modeling and research. Use parallel computing to
speed up cross validation.

Load the `ovariancancer`

data.
This data has 216 observations and 4000 predictors in the `obs`

workspace
variable. The responses are binary, either `'Cancer'`

or `'Normal'`

,
in the `grp`

workspace variable. Convert
the responses to binary for use in `lassoglm`

.

load ovariancancer y = strcmp(grp,'Cancer');

Set options to use parallel computing.
Prepare to compute in parallel using `parpool`

.

```
opt = statset('UseParallel',true);
parpool()
```

Starting parpool using the 'local' profile ... connected to 2 workers. ans = Pool with properties: AttachedFiles: {0x1 cell} NumWorkers: 2 IdleTimeout: 30 Cluster: [1x1 parallel.cluster.Local] RequestQueue: [1x1 parallel.RequestQueue] SpmdEnabled: 1

Fit a cross-validated set of regularized
models. Use the `Alpha`

parameter to favor
retaining groups of highly correlated predictors, as opposed to eliminating
all but one member of the group. Commonly, you use a relatively large
value of `Alpha`

.

rng('default') % for reproducibility tic [B,S] = lassoglm(obs,y,'binomial','NumLambda',100, ... 'Alpha',0.9,'LambdaRatio',1e-4,'CV',10,'Options',opt); toc

Elapsed time is 398.635386 seconds.

Examine cross-validation plot.

lassoPlot(B,S,'PlotType','CV');

Examine trace plot.

lassoPlot(B,S,'PlotType','Lambda','XScale','log')

The right (green) vertical dashed
line represents the `Lambda`

providing
the smallest cross-validated deviance. The left (blue) dashed line
has the minimal deviance plus no more than one standard deviation.
This blue line has many fewer predictors:

[S.DF(S.Index1SE) S.DF(S.IndexMinDeviance)]

ans = 50 86

You asked `lassoglm`

to
fit using 100 different `Lambda`

values.
How many did it use?

size(B)

ans = 4000 84

`lassoglm`

stopped after
84 values because the deviance was too small for small `Lambda`

values.
To avoid overfitting, `lassoglm`

halts
when the deviance of the fitted model is too small compared to the
deviance in the binary responses, ignoring the predictor variables.

You can force `lassoglm`

to
include more terms by explicitly providing a set of `Lambda`

values.

minLambda = min(S.Lambda); explicitLambda = [minLambda*[.1 .01 .001] S.Lambda]; [B2,S2] = lassoglm(obs,y,'binomial','Lambda',explicitLambda,... 'LambdaRatio',1e-4, 'CV',10,'Options',opt); length(S2.Lambda)

ans = 87

`lassoglm`

used the three
smaller values in fitting.

To save time, you can use:

Fewer

`Lambda`

, meaning fewer fitsFewer cross-validation folds

A larger value for

`LambdaRatio`

Use serial computation and all three of these time-saving methods:

tic [Bquick,Squick] = lassoglm(obs,y,'binomial','NumLambda',25,... 'LambdaRatio',1e-2,'CV',5); toc

Elapsed time is 51.708074 seconds.

Graphically compare the new results to the first results.

lassoPlot(Bquick,Squick,'PlotType','CV');

lassoPlot(Bquick,Squick,'PlotType','Lambda','XScale','log')

The number of nonzero coefficients in the lowest plus one standard deviation model is around 50, similar to the first computation.

*Lasso* is a regularization
technique for estimating generalized linear models. Lasso includes
a penalty term that constrains the size of the estimated coefficients.
Therefore, it resembles ridge regression. Lasso
is a *shrinkage estimator*: it generates
coefficient estimates that are biased to be small. Nevertheless, a
lasso estimator can have smaller error than an ordinary maximum likelihood
estimator when you apply it to new data.

Unlike ridge regression, as the penalty term increases, the lasso technique sets more coefficients to zero. This means that the lasso estimator is a smaller model, with fewer predictors. As such, lasso is an alternative to stepwise regression and other model selection and dimensionality reduction techniques.

*Elastic net* is
a related technique. Elastic net is akin to a hybrid of ridge regression
and lasso regularization. Like lasso, elastic net can generate reduced
models by generating zero-valued coefficients. Empirical studies suggest
that the elastic net technique can outperform lasso on data with highly
correlated predictors.

For a nonnegative value of *λ*, `lasso`

solves
the problem

$$\underset{{\beta}_{0},\beta}{\mathrm{min}}\left(\frac{1}{N}\text{Deviance}\left({\beta}_{0},\beta \right)+\lambda {\displaystyle \sum _{j=1}^{p}\left|{\beta}_{j}\right|}\right),$$

where

Deviance is the deviance of the model fit to the responses using intercept

*β*_{0}and predictor coefficients*β*. The formula for Deviance depends on the`distr`

parameter you supply to`lassoglm`

. Minimizing the*λ*-penalized deviance is equivalent to maximizing the*λ*-penalized log likelihood.*N*is the number of observations.*λ*is a nonnegative regularization parameter corresponding to one value of`Lambda`

.Parameters

*β*_{0}and*β*are scalar and*p*-vector respectively.

As *λ* increases, the number of nonzero
components of *β* decreases.

The lasso problem involves the *L*^{1} norm
of *β*, as contrasted with the elastic net
algorithm.

For an *α* strictly between 0 and 1,
and a nonnegative *λ*, elastic net solves the
problem

$$\underset{{\beta}_{0},\beta}{\mathrm{min}}\left(\frac{1}{N}\text{Deviance}\left({\beta}_{0},\beta \right)+\lambda {P}_{\alpha}\left(\beta \right)\right),$$

where

$${P}_{\alpha}\left(\beta \right)=\frac{(1-\alpha )}{2}{\Vert \beta \Vert}_{2}^{2}+\alpha {\Vert \beta \Vert}_{1}={\displaystyle \sum _{j=1}^{p}\left(\frac{(1-\alpha )}{2}{\beta}_{j}^{2}+\alpha \left|{\beta}_{j}\right|\right)}.$$

Elastic net is the same as lasso when *α* = 1. For other values of *α*,
the penalty term *P _{α}*(

`ridge`

regression.[1] Tibshirani, R. *Regression
Shrinkage and Selection via the Lasso.* Journal of the
Royal Statistical Society, Series B, Vol. 58, No. 1, pp. 267–288,
1996.

[2] Zou, H. and T. Hastie. *Regularization
and Variable Selection via the Elastic Net.* Journal of
the Royal Statistical Society, Series B, Vol. 67, No. 2, pp. 301–320,
2005.

[3] Friedman, J., R. Tibshirani,
and T. Hastie. *Regularization Paths for Generalized
Linear Models via Coordinate Descent.* Journal of Statistical
Software, Vol. 33, No. 1, 2010. `http://www.jstatsoft.org/v33/i01`

[4] Hastie, T., R. Tibshirani,
and J. Friedman. *The Elements of Statistical
Learning,* 2nd edition. Springer, New York, 2008.

[5] McCullagh, P., and J. A. Nelder. *Generalized
Linear Models,* 2nd edition. Chapman & Hall/CRC Press,
1989.

Was this topic helpful?