# Documentation

### This is machine translation

Translated by
Mouseover text to see original. Click the button below to return to the English verison of the page.

# lasso

Regularized least-squares regression using lasso or elastic net algorithms

## Syntax

```B = lasso(X,Y) B = lasso(X,Y,Name,Value) [B,FitInfo] = lasso(___) ```

## Description

`B = lasso(X,Y)` returns fitted least-squares regression coefficients for a set of regularization coefficients `Lambda`.

`B = lasso(X,Y,Name,Value)` fits regularized regressions with additional options specified by one or more `Name,Value` pair arguments.

```[B,FitInfo] = lasso(___)```, for any previous input syntax, also returns a structure containing information about the fits.

## Input Arguments

 `X` Numeric matrix. Each row represents one observation, and each column represents one predictor (variable). `Y` Numeric vector of length `n`, where `n` is the number of rows of `X`. `Y(i)` is the response to row `i` of `X`.

### Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside single quotes (`' '`). You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

`'AbsTol'`

Absolute error tolerance used to determine convergence of ADMM Algorithm. The algorithm converges when successive estimates of the coefficient vector differ by an amount less than `AbsTol`.

### Note

This option only applies when using `lasso` on tall arrays. See Extended Capabilities for more information.

Default: `1e-4`

`'Alpha'`

Scalar value in the interval `(0,1]` representing the weight of lasso (L1) versus ridge (L2) optimization. `Alpha = 1` represents lasso regression, `Alpha` close to `0` approaches ridge regression, and other values represent elastic net optimization. See Definitions.

Default: `1`

`'B0'`

Initial values for x-coefficients in ADMM Algorithm.

### Note

This option only applies when using `lasso` on tall arrays. See Extended Capabilities for more information.

Default: Vector of zeros

`'CV'`

Method `lasso` uses to estimate mean squared error:

• `K`, a positive integer — `lasso` uses `K`-fold cross-validation.

• `cvp`, a `cvpartition` object — `lasso` uses the cross-validation method expressed in `cvp`. You cannot use a `'leaveout'` partition with `lasso`.

• `'resubstitution'``lasso` uses `X` and `Y` to fit the model and to estimate the mean squared error without cross-validation.

Default: `'resubstitution'`

`'DFmax'`

Maximum number of nonzero coefficients in the model. `lasso` returns results only for `Lambda` values that satisfy this criterion.

Default: `Inf`

`'Lambda'`

Vector of nonnegative `Lambda` values. See Definitions.

• If you do not supply `Lambda`, `lasso` calculates the largest value of `Lambda` that gives a nonnull model. In this case, `LambdaRatio` gives the ratio of the smallest to the largest value of the sequence, and `NumLambda` gives the length of the vector.

• If you supply `Lambda`, `lasso` ignores `LambdaRatio` and `NumLambda`.

Default: Geometric sequence of `NumLambda` values, the largest just sufficient to produce `B` = `0`

`'LambdaRatio'`

Positive scalar, the ratio of the smallest to the largest `Lambda` value when you do not set `Lambda`.

If you set `LambdaRatio = 0`, `lasso` generates a default sequence of `Lambda` values, and replaces the smallest one with `0`.

Default: `1e-4`

`'MaxIter'`

Maximum number of iterations allowed, specified as a positive integer. If the algorithm executes `MaxIter` iterations before reaching the convergence tolerance, then the function stops iterating and returns a warning message. The function can return more than one warning when `NumLambda` is greater than `1`.

Default: `1e5` (standard), `1e4` (for tall arrays)

`'MCReps'`

Positive integer, the number of Monte Carlo repetitions for cross-validation.

• If `CV` is `'resubstitution'` or a `cvpartition` of type `'resubstitution'`, `MCReps` must be `1`.

• If `CV` is a `cvpartition` of type `'holdout'`, `MCReps` must be greater than `1`.

Default: `1`

`'NumLambda'`

Positive integer, the number of `Lambda` values `lasso` uses when you do not set `Lambda`. `lasso` can return fewer than `NumLambda` fits if the residual error of the fits drops below a threshold fraction of the variance of `Y`.

Default: `100`

`'Options'`

Structure that specifies whether to cross-validate in parallel, and specifies the random streams. Create the `Options` structure with `statset`. Option fields:

• `UseParallel` — Set to `true` to compute in parallel. Default is `false`.

• `UseSubstreams` — Set to `true` to compute in parallel in a reproducible fashion. To compute reproducibly, set `Streams` to a type allowing substreams: `'mlfg6331_64'` or `'mrg32k3a'`. Default is `false`.

• `Streams` — A `RandStream` object or cell array consisting of one such object. If you do not specify `Streams`, `lasso` uses the default stream.

`'PredictorNames'`

Cell array of character vectors representing names of the predictor variables, in the order in which they appear in `X`. For an example, see Remove Redundant Predictors by Using Cross-Validated Fits.

Default: `{}`

`'RelTol'`

Convergence threshold for the coordinate descent algorithm [3]. The algorithm terminates when successive estimates of the coefficient vector differ in the L2 norm by a relative amount less than `RelTol`.

Default: `1e-4`

`'Rho'`

Augmented Lagrangian parameter ρ for ADMM Algorithm.

### Note

This option only applies when using `lasso` on tall arrays. See Extended Capabilities for more information.

Default: Automatic selection

`'Standardize'`

Boolean value specifying whether `lasso` scales `X` before fitting the models. This affects whether the regularization is applied to the coefficients on the standardized scale or original scale. The results are always presented on the original data scale.

`X` and `Y` are always centered.

Default: `true`

`'U0'`

Initial value of scaled dual variable u in ADMM Algorithm.

### Note

This option only applies when using `lasso` on tall arrays. See Extended Capabilities for more information.

Default: Vector of zeros

`'Weights'`

Observation weights, a nonnegative vector of length `n`, where `n` is the number of rows of `X`. `lasso` scales `Weights` to sum to `1`.

Default: `1/n * ones(n,1)`

## Output Arguments

`B`

Fitted coefficients, a `p`-by-`L` matrix, where `p` is the number of predictors (columns) in `X`, and `L` is the number of `Lambda` values.

`FitInfo`

Structure containing information about the model fits.

Field in FitInfoDescription
`Intercept`Intercept term β0 for each linear model, a `1`-by-`L` vector
`Lambda`Lambda parameters in ascending order, a `1`-by-`L` vector
`Alpha`Value of `Alpha` parameter, a scalar
`DF`Number of nonzero coefficients in `B` for each value of `Lambda`, a `1`-by-`L` vector
`MSE`Mean squared error (MSE), a `1`-by-`L` vector

If you set the `CV` name-value pair to cross-validate, the `FitInfo` structure contains additional fields.

Field in FitInfoDescription
`SE`The standard error of MSE for each `Lambda`, as calculated during cross-validation, a `1`-by-`L` vector
`LambdaMinMSE`The `Lambda` value with minimum MSE, a scalar
`Lambda1SE`The largest `Lambda` such that MSE is within one standard error of the minimum MSE, a scalar
`IndexMinMSE`The index of `Lambda` with value `LambdaMinMSE`, a scalar
`Index1SE`The index of `Lambda` with value `Lambda1SE`, a scalar

## Examples

collapse all

Construct a data set with redundant predictors and identify those predictors by using `lasso`.

Create a matrix `X` of 100 five-dimensional normal variables. Create a response vector `Y` from just two components of `X` and add a small amount of noise.

```rng default % For reproducibility X = randn(100,5); r = [0;2;0;-3;0]; % Only two nonzero coefficients Y = X*r + randn(100,1)*.1; % Small added noise ```

Construct the default lasso fit.

```B = lasso(X,Y); ```

Find the coefficient vector for the 25th value in `B`.

```B(:,25) ```
```ans = 0 1.6093 0 -2.5865 0 ```

`lasso` identifies and removes the redundant predictors.

Construct a data set with redundant predictors and identify those predictors by using cross-validated `lasso`.

Create a matrix `X` of 100 five-dimensional normal variables. Create a response vector `Y` from two components of `X` and add a small amount of noise.

```rng default % For reproducibility X = randn(100,5); r = [0;2;0;-3;0]; % Only two nonzero coefficients Y = X*r + randn(100,1)*.1; % Small added noise ```

Construct the lasso fit by using tenfold cross-validation with labeled predictor variables.

```[B,FitInfo] = lasso(X,Y,'CV',10,'PredictorNames',{'x1','x2','x3','x4','x5'}); ```

Display the variables in the model that corresponds to the minimum cross-validated mean squared error (MSE).

```minMSEModel = FitInfo.PredictorNames(B(:,FitInfo.IndexMinMSE)~=0) ```
```minMSEModel = 1x2 cell array {'x2'} {'x4'} ```

Display the variables in the sparsest model within one standard error of the minimum MSE.

```sparseModel = FitInfo.PredictorNames(B(:,FitInfo.Index1SE)~=0) ```
```sparseModel = 1x2 cell array {'x2'} {'x4'} ```

In this example, `lasso` identifies the same predictors for the two models and removes the redundant predictors. However, in general, `lasso` can choose a different set of predictors.

Visually examine the cross-validated error of various levels of regularization.

```load acetylene ```

Prepare the design matrix for a lasso fit with interactions.

```X = [x1 x2 x3]; D = x2fx(X,'interaction'); D(:,1) = []; % No constant term ```

Construct the lasso fit using ten-fold cross-validation. Include the `FitInfo` output so you can plot the result.

```rng default % For reproducibility [B,FitInfo] = lasso(D,y,'CV',10); ```

Plot the cross-validated fits.

```lassoPlot(B,FitInfo,'PlotType','CV'); ```

The green circle and dashed line locate the `Lambda` with minimum cross-validation error. The blue circle and dashed line locate the point with minimum cross-validation error plus one standard deviation.

collapse all

### Lasso

For a given value of λ, a nonnegative parameter, `lasso` solves the problem

`$\underset{{\beta }_{0},\beta }{\mathrm{min}}\left(\frac{1}{2N}\sum _{i=1}^{N}{\left({y}_{i}-{\beta }_{0}-{x}_{i}^{T}\beta \right)}^{2}+\lambda \sum _{j=1}^{p}|{\beta }_{j}|\right).$`
• N is the number of observations.

• yi is the response at observation i.

• xi is data, a vector of p values at observation i.

• λ is a nonnegative regularization parameter corresponding to one value of `Lambda`.

• The parameters β0 and β are a scalar and a vector of length p, respectively.

As λ increases, the number of nonzero components of β decreases.

The lasso problem involves the L1 norm of β, as contrasted with the elastic net algorithm.

### Elastic Net

For an α strictly between 0 and 1, and a nonnegative λ, elastic net solves the problem

`$\underset{{\beta }_{0},\beta }{\mathrm{min}}\left(\frac{1}{2N}\sum _{i=1}^{N}{\left({y}_{i}-{\beta }_{0}-{x}_{i}^{T}\beta \right)}^{2}+\lambda {P}_{\alpha }\left(\beta \right)\right),$`

where

`${P}_{\alpha }\left(\beta \right)=\frac{\left(1-\alpha \right)}{2}{‖\beta ‖}_{2}^{2}+\alpha {‖\beta ‖}_{1}=\sum _{j=1}^{p}\left(\frac{\left(1-\alpha \right)}{2}{\beta }_{j}^{2}+\alpha |{\beta }_{j}|\right).$`

Elastic net is the same as lasso when α = 1. As α shrinks toward 0, elastic net approaches `ridge` regression. For other values of α, the penalty term Pα(β) interpolates between the L1 norm of β and the squared L2 norm of β.

## Algorithms

collapse all

When operating on tall arrays, `lasso` uses an algorithm based on the Alternating Direction Method of Multipliers (ADMM) [5]. The notation used here is the same as in the reference paper. This method solves problems of the form

Minimize $l\left(x\right)+g\left(z\right)$

Subject to $Ax+Bz=c$

Using this notation the lasso regression problem is

Minimize $l\left(x\right)+g\left(z\right)=\frac{1}{2}{‖Ax-b‖}_{2}^{2}+\lambda {‖z‖}_{1}$

Subject to $x-z=0$

Since the loss function $l\left(x\right)=\frac{1}{2}{‖Ax-b‖}_{2}^{2}$ is quadratic, the iterative updates performed by the algorithm amount to solving a linear system of equations with a single coefficient matrix but several right-hand sides. The updates performed by the algorithm during each iteration are

`$\begin{array}{l}{x}^{k+1}={\left({A}^{T}A+\rho I\right)}^{-1}\left({A}^{T}b+\rho \left({z}^{k}-{u}^{k}\right)\right)\\ {z}^{k+1}={S}_{\lambda /\rho }\left({x}^{k+1}+{u}^{k}\right)\\ {u}^{k+1}={u}^{k}+{x}^{k+1}-{z}^{k+1}.\end{array}$`

A is the dataset (a tall array), x contains the coefficients, ρ is the penalty parameter (augmented Lagrangian parameter), b is the response (a tall array), and S is the soft thresholding operator.

`${S}_{\kappa }\left(a\right)=\left\{\begin{array}{c}\begin{array}{cc}a-\kappa ,\text{\hspace{0.17em}}& a>\kappa \end{array}\\ \begin{array}{cc}0,\text{\hspace{0.17em}}& |a|\text{\hspace{0.17em}}\le \kappa \text{\hspace{0.17em}}\end{array}\\ \begin{array}{cc}a+\kappa ,\text{\hspace{0.17em}}& a<\kappa \text{\hspace{0.17em}}\end{array}\end{array}.$`

`lasso` solves the linear system using Cholesky factorization since the coefficient matrix ${A}^{T}A+\rho I$ is symmetric and positive definite. Since $\rho$ does not change between iterations, the Cholesky factorization is cached between iterations instead of solving from scratch.

Even though A and b are tall arrays, they appear only in the terms ${A}^{T}A$ and ${A}^{T}b$. The results of these two matrix multiplications are small enough to fit in memory, so they are precomputed and the iterative updates between iterations are performed entirely within memory.

## References

[1] Tibshirani, R. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B, Vol. 58, No. 1, 1996, pp. 267–288.

[2] Zou, H. and T. Hastie. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society. Series B, Vol. 67, No. 2, 2005, pp. 301–320.

[3] Friedman, J., R. Tibshirani, and T. Hastie. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software. Vol. 33, No. 1, 2010. `http://www.jstatsoft.org/v33/i01`

[4] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. 2nd edition. New York: Springer, 2008.

[5] Boyd, S. “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers”. Foundations and Trends in Machine Learning. Vol 3, No. 1, 2010, pp. 1–122.