To determine a good lasso-penalty strength for a linear regression model that uses least squares, implement 5-fold cross-validation.

Simulate 10000 observations from this model

$$y={x}_{100}+2{x}_{200}+e.$$

$$X=\{{x}_{1},...,{x}_{1000}\}$$ is a 10000-by-1000 sparse matrix with 10% nonzero standard normal elements.

*e* is random normal error with mean 0 and standard deviation 0.3.

Create a set of 15 logarithmically-spaced regularization strengths from $$1{0}^{-5}$$ through $$1{0}^{-1}$$.

Cross-validate the models. To increase execution speed, transpose the predictor data and specify that the observations are in columns. Optimize the objective function using SpaRSA.

`CVMdl`

is a `RegressionPartitionedLinear`

model. Because `fitrlinear`

implements 5-fold cross-validation, `CVMdl`

contains 5 `RegressionLinear`

models that the software trains on each fold.

Display the first trained linear regression model.

Mdl1 =
RegressionLinear
ResponseName: 'Y'
ResponseTransform: 'none'
Beta: [1000x15 double]
Bias: [1x15 double]
Lambda: [1x15 double]
Learner: 'leastsquares'
Properties, Methods

`Mdl1`

is a `RegressionLinear`

model object. `fitrlinear`

constructed `Mdl1`

by training on the first four folds. Because `Lambda`

is a sequence of regularization strengths, you can think of `Mdl1`

as 15 models, one for each regularization strength in `Lambda`

.

Estimate the cross-validated MSE.

Higher values of `Lambda`

lead to predictor variable sparsity, which is a good quality of a regression model. For each regularization strength, train a linear regression model using the entire data set and the same options as when you cross-validated the models. Determine the number of nonzero coefficients per model.

In the same figure, plot the cross-validated MSE and frequency of nonzero coefficients for each regularization strength. Plot all variables on the log scale.

Choose the index of the regularization strength that balances predictor variable sparsity and low MSE (for example, `Lambda(10)`

).

Extract the model with corresponding to the minimal MSE.

MdlFinal =
RegressionLinear
ResponseName: 'Y'
ResponseTransform: 'none'
Beta: [1000x1 double]
Bias: -0.0050
Lambda: 0.0037
Learner: 'leastsquares'
Properties, Methods

EstCoeff = *2×1*
1.0051
1.9965

`MdlFinal`

is a `RegressionLinear`

model with one regularization strength. The nonzero coefficients `EstCoeff`

are close to the coefficients that simulated the data.