MATLAB Examples

Bayesian Lasso Regression

This example shows how to perform variable selection using Bayesian lasso regression.

Contents

Lasso regression is a linear regression technique that combines regularization and variable selection. Regularization helps prevent overfitting by decreasing the magnitude of the regression coefficients. What sets lasso regression apart from other regularization techniques, e.g., ridge regression, is lasso attributes a value of exactly 0 to regression coefficients corresponding to predictors that are not important.

Consider a simple linear regression model

$$y = \beta_0 1_n + X\beta + \varepsilon,$$

where $y$ is a vector of n responses, $X$ is a matrix of p corresponding observed predictor variables, $\beta$ is a vector of p regression coefficients, $\beta_0$ is the intercept, $1_n$ is a length n column vector of ones, and $\varepsilon$ is a random vector of I.I.D. Gaussian disturbances with a mean of 0 and variance of $\sigma^2$. The objective function for lasso regression is

$$f\left( {\beta_0,\beta;y,X} \right) = 0.5\left\| {y - \beta_0 1_n + X\beta } \right\|_2^2 + \lambda \left\| \beta  \right\|_1^2$$

$\lambda$ is the penalty term, and not fit to the data like the other parameters. You must choose a value for it before estimation, and you might have to tune it. Once you choose a value, $f$ is minimized with respect to the regression coefficients. The resulting coefficients are the lasso estimates.

For more details on lasso regression, see docid:econ_ug.bvmwdog.

Load Data

Load the acetylene data set. Create a variable for all predictors.

load acetylene
X = [x1 x2 x3];

The sample size is 16, and there are 3 predictors. For more details, enter Description at the command line.

Fit Simple Linear Regression Model

Fit a simple linear regression model to the data. Extract and display the coefficients and the MSE.

SLRMdl = fitlm(X,y)
slrBeta = SLRMdl.Coefficients.Estimate
slrMSE = SLRMdl.MSE
SLRMdl = 


Linear regression model:
    y ~ 1 + x1 + x2 + x3

Estimated Coefficients:
                   Estimate       SE        tStat       pValue 
                   ________    ________    ________    ________

    (Intercept)    -121.27       55.436     -2.1876    0.049222
    x1             0.12685     0.042182      3.0073    0.010918
    x2             0.34816      0.17702      1.9668    0.072763
    x3             -19.022       107.93    -0.17624     0.86304


Number of observations: 16, Error degrees of freedom: 12
Root Mean Squared Error: 3.77
R-squared: 0.92,  Adjusted R-Squared 0.9
F-statistic vs. constant model: 45.9, p-value = 7.52e-07

slrBeta =

 -121.2696
    0.1269
    0.3482
  -19.0217


slrMSE =

   14.1908

SLRMdl is a LinearModel model object. An estimation summary appears at the command line. The t statistics suggest that, at 5% level of significance, x1 is an important variable for prediction, but x2 and x3 might not be as important.

Fit Lasso Regression Model

Fit a lasso regression model to the data. lasso standardizes the data by default. Specify refraining from standardization.

[LassoBetaEstimates,FitInfo] = lasso(X,y,'Standardize',false);

By default, lasso fits the lasso regression model 100 times using a sequence of values for $\lambda$. Hence, lassoBeta is a 3-by-100 matrix of regression coefficient values. Rows correspond to predictor variables in X and columns correspond to values of $\lambda$. FitInfo is a structure that includes the values of $\lambda$ (FitInfo.Lambda) and the mean squared error for each fit (FitInfo.MSE).

Plot the magnitude of the regression coefficients with respect to their $L^1$ norm.

lassoPlot(LassoBetaEstimates);
legend('\beta_1','\beta_2','\beta_3','Location','NW');

Choose and display the lasso regression estimate corresponding to the smallest value of $\lambda$. Also, choose and display the corresponding MSE.

lassoBeta = LassoBetaEstimates(:,1);
lassoBeta = [FitInfo.Intercept(:,1); lassoBeta]
lassoMSE = FitInfo.MSE(:,1)
lassoBeta =

 -130.6923
    0.1340
    0.3481
         0


lassoMSE =

   10.6709

The lasso plot suggests that x3 is not important because $\beta_3$ is zero for all regularization values. It appears that x1 is the most important predictor because $\beta_1$ has a nonzero magnitude for most regularization values. The MSE from lasso regression is less than the MSE from simple linear regression.

Fit Bayesian Linear Regression Model

A Bayesian view of lasso uses these assumptions:

  • $\sigma^2\sim IG(a,b)$, where $a$ and $b$ are hyperparameters.
  • $\beta_0\vert \sigma^2 \sim N(\mu_0,\sigma^2 v_0)$, where $v_0$ is a hyperparameter.
  • $\beta_j\vert\psi_j,\sigma^2\sim N(0,\psi_j\sigma^2)$ for $j = 1,...,p$. $\psi_j$ is a latent Gaussian scale parameter for each regression coefficient.
  • $\psi_j\sim Expo(2\lambda^{-2})$, where $\lambda$ is a constant.

Consequently,

  • $\beta_j\vert \sigma^2\sim Expo2(0,\sigma/\lambda)$, where $Expo2$ is the double exponential distribution.
  • $\psi_j^{-1}\vert \beta_0,\beta_j,\sigma^2,y,X\sim InvG(|\lambda\sigma/\beta_j|,\lambda^2)$, where $InvG$ is the inverse Gaussian distribution.

For more details, see docid:econ_ug.bvmwdr5.

You can estimate or sample from the posterior distribution by using one of these samplers: the slice, random walk metropolis, Hamiltonian Monte Carlo (HMC), or Gibbs sampler.

For the HMC sampler, you must create a custom Bayesian linear regression model that specifies the prior distributions of $\beta_0$, $\beta_j\vert\sigma^2$, and $\sigma^2$. estimate sets up the HMC sampler for you, but, if you need to, you can tune its parameters.

You must set the Gibbs sampler up yourself.

Fit Using HMC Sampler

Declare the function priorLasso.m, which returns the log pdf of the joint prior distribution of the intercept, regression coefficients, disturbance variance.

function logPDF = priorLasso(params,mu,v,a,b,lambda)
%priorLasso Log density of Bayesian lasso regression prior distributions 
%   priorLasso assumes:
%
%       * params(1) is the intercept and is normally distributed with a mean
%       of mu and scaled variance v.
%
%       * params(2:end - 1) are the regression coefficients and
%       individually have double exponential distributions with location
%       parameter 0 and scale sqrt(params(end)/lambda).
%
%       * params(end) is the disturbance variance and has an inverse gamma
%       distribution with shape a and scale b.
%
%   priorLasso returns the log of the product of all densities.
%   
%   params: Parameter values at which the densities are evaluated, an
%           m-by-1 numeric vector.
%
%       mu: Center of normal distribution for the intercept, a numeric 
%           numeric vector.
%
%        v: Scale of normal distribution for the intercept, a positive
%           numeric scalar.
% 
%        a: Inverse gamma shape parameter, a positive numeric scalar.
%
%        b: Inverse gamma scale parameter, a positive scalar.
%
%   lambda: Lasso penalty term, a nonnegative numeric scalar.

sigma2PDF = params(end)^(-a-1)*exp(-1/(params(end)*b))/(gamma(a)*b^a);
interceptPDF = normpdf(params(1),mu,sqrt(params(end)*v));
betaPDF = 1/2*sqrt(params(end))*exp(-lambda*abs(params(2:end-1))/sqrt(params(end)));
logPDF = log(sigma2PDF) + log(interceptPDF) + sum(log(betaPDF));
end


You must set values for the hyperparameters. Set mu to 0, v to 1e5, a to 3, b to 1, and $\lambda$ to 15. Create an anonymous function that passes the hyperparameter values to priorLasso, but leaves params as an argument.

mu = 0;
v = 1e5;
a = 3;
b = 1;
lambda = 15;

logPDF = @(params)priorLasso(params,mu,v,a,b,lambda);

Ideally, you should use a range of values for $\lambda$. For simplicity, this example uses one value representing a high penalty.

Create a Bayesian linear regression model with the custom prior. Specify that there are three predictor variables.

p = 3;
PriorMdl = bayeslm(p,'ModelType','custom','LogPDF',logPDF);

PriorMdl is a customblm model object.

Estimate the posterior distribution and return estimates. Specify the HMC sampler, reparametrizing the variance, and drawing 5000 samples.

rng(1); % For reproducibility
[EstMdl,customBLMBeta,~,customBLMMSE] = estimate(PriorMdl,X,y,...
    'Sampler',"hmc",'NumDraws',5000,'Reparameterize',true);

customBLMBeta
customBLMMSE
Method: MCMC sampling with 5000 draws
Number of observations: 16
Number of predictors:   4
 
           |    Mean      Std            CI95          Positive  Distribution 
------------------------------------------------------------------------------
 Intercept | -130.3866  14.4624  [-158.817, -101.690]    0.000     Empirical  
 Beta(1)   |    0.1348   0.0122    [ 0.110,  0.159]      1.000     Empirical  
 Beta(2)   |    0.2470   0.1650    [-0.056,  0.584]      0.942     Empirical  
 Beta(3)   |    0.0022   0.3881    [-0.774,  0.782]      0.509     Empirical  
 Sigma2    |   14.3424   6.1225    [ 6.691, 29.655]      1.000     Empirical  
 

customBLMBeta =

 -130.3866
    0.1348
    0.2470
    0.0022


customBLMMSE =

   14.3424

EstMdl is an empiricalblm model object that contains the HMC sample draws from the posterior distribution. $\beta_3$ receives most of the penalty, while the values of the other coefficients are close to those from lasso regression.

Plot the traces of the HMC sampler draws.

figure;
for j = 1:(p + 1)
    subplot(2,2,j)
    plot(EstMdl.BetaDraws(j,:))
    title(sprintf('%s',EstMdl.VarNames{j}));
    xlabel('Simulation index');
    ylabel('HMC sampler draw')
end

figure;
plot(EstMdl.Sigma2Draws)
title('Sigma2');
xlabel('Simulation index');
ylabel('HMC sampler draw');

The trace plots indicate adaquate mixing of the MCMC sample.

Fit using Gibbs Sampler

Preallocate a 4-by-10000 matrix of zeros, a 1-by-10000 vector of zeros, and a 3-by-10000 matrix of ones for the posterior draws of the regression coefficients, disturbance variance, and latent scales.

m = 10000;
GibbsBetaDraws = zeros(p+1,m);
gibbsSigma2Draws = zeros(1,m);
GibbsPsiDraws = ones(p,m);
psi = GibbsPsiDraws(:,1);

Implement the Gibbs sampling by following these steps for each draw.

  1. Create a conjugate prior model for $\beta_0,\beta_j,\sigma^2\vert\psi_j$.
  2. Sample from the posterior of $\beta_0,\beta_j,\sigma^2\vert\psi_j,y,X$ using the current value of $psi_j$ for each $j = 1,...,p.$
  3. Sample from $\psi_j\vert \beta_0,\beta_j,\sigma^2,y,X$ using the current values of $\beta_0$, $\beta_j$, and $\sigma^2$.
for k = 1:m
    muGibbs = [mu; 0; 0; 0];
    VGibbs = diag([v; psi]);
    TmpMdl = bayeslm(p,'ModelType','conjugate','Mu',muGibbs,'V',VGibbs,...
        'A',a,'B',b);
    [GibbsBetaDraws(:,k),gibbsSigma2Draws(k)] = simulate(TmpMdl,X,y);
    muIGPsi = abs(lambda*sqrt(gibbsSigma2Draws(k))./GibbsBetaDraws(2:end,k));
    vIGPsi = lambda^2;
    GibbsPsiDraws(:,k) = 1./random('inversegaussian',muIGPsi,vIGPsi);
    psi = GibbsPsiDraws(:,k);
end

Plot the traces of the Gibbs sample draws.

figure;
for j = 1:(p + 1)
    subplot(2,2,j)
    plot(GibbsBetaDraws(j,:))
    title(sprintf('%s',EstMdl.VarNames{j}));
    xlabel('Simulation index');
    ylabel('Gibbs draw');
end

figure;
plot(gibbsSigma2Draws)
title('Sigma2');
xlabel('Simulation index');
ylabel('Gibbs draw');

figure;
for j = 1:p
    subplot(2,2,j)
    plot(GibbsPsiDraws(j,:))
    title(sprintf('psi %d',j));
    xlabel('Simulation index');
    ylabel('Gibbs draw')
end

The trace plots indicate that the MCMC sample is mixing well.

Estimate the posterior means of the regression coefficients and the disturbance variance (MSE). Treat the first 5000 draws as a burn in sample.

gibbsBeta = mean(GibbsBetaDraws(:,5001:end),2)
gibbsMSE = mean(gibbsSigma2Draws(5001:end))
gibbsBeta =

 -130.8014
    0.1350
    0.2553
    0.0004


gibbsMSE =

    9.7098

The results from Gibbs sampling are similar to the results using the custom Bayesian linear regression model.

The Bayesian linear regression estimates are fairly close to those from lasso. Discrepancies come from two sources:

  • The Bayesian estimates are the mean of the posterior distributions, rather than the mode.
  • The Bayesian model setup implies that $\lambda_1 = \lambda_2\frac{\sigma}{n}$, where $\lambda_1$ is the penalty for the frequentist method and $\lambda_2$ is the penalty for the Bayesian method.

Plot histograms of the posterior draws post burn in.

figure;
for j = 1:(p + 1)
    subplot(2,2,j)
    histogram(GibbsBetaDraws(j,5001:end),'Normalization','pdf')
    title(sprintf('%s',EstMdl.VarNames{j}));
end

The coefficients are approximately bell shaped, so the mean is a good approximation of the mode.

Compute the corresponding value of $\lambda_1$, and then implement lasso regression using $\lambda_1$. For $\sigma$ use a reasonable estimate, such as the posterior mean from the Gibbs sample.

lambda1 = lambda*sqrt(gibbsMSE)/numel(y);
[lassoBeta2,FitInfo2] = lasso(X,y,'Lambda',lambda1,'Standardize',false);
lassoBeta2 = [FitInfo2.Intercept(:,1); lassoBeta2]
lassoMSE2 = FitInfo2.MSE(:,1)
lassoBeta2 =

 -130.7732
    0.1351
    0.2504
         0


lassoMSE2 =

   10.9617

Display all estimates in a table for comparison.

table([slrBeta; slrMSE],[lassoBeta2; lassoMSE2],[customBLMBeta; customBLMMSE],...
    [gibbsBeta; gibbsMSE],'VariableNames',{'SLR' 'Lasso' 'CustomBLM' 'Gibbs'},...
    'RowNames',[EstMdl.VarNames; 'MSE'])
ans =

  5x4 table

                   SLR       Lasso     CustomBLM     Gibbs 
                 _______    _______    _________    _______

    Intercept    -121.27    -130.77      -130.39     -130.8
    Beta(1)      0.12685    0.13506      0.13475    0.13504
    Beta(2)      0.34816    0.25035      0.24703    0.25532
    Beta(3)      -19.022          0    0.0021912     0.0004
    MSE           14.191     10.962       14.342     9.7098