MATLAB Examples

Implement Bayesian Variable Selection

This example shows how to implement stochastic search variable selection (SSVS), a Bayesian variable selection technique. Also, the example compares the performance of several supported Markov chain Monte Carlo (MCMC) samplers.

Contents

Introduction

Consider this Bayesian linear regression model.

$$y_t = \beta_0 + \beta_1 x_{t1} + \beta_2 x_{t2} + \varepsilon_t.$$

  • $\beta_j|\sigma^2 \sim N(\mu_j,\sigma^2V_j)$.
  • $\varepsilon_t \sim N(0,\sigma^2)$.
  • $\sigma^2 \sim IG(a,b)$, where $IG(a,b)$ is the inverse gamma distribution with shape a and scale b.

The goal of variable selection is to include only those predictors supported by data in the final regression model. One way to do this is to analyze the $2^p$ permutations of models, where models differ by the coefficients that are included. You first fit the models to data and then compare the models by using performance measures, such as goodness-of-fit or forecast error.

A Bayesian view of variable selection is a coefficient being excluded from a model with a degenerate prior distribution, that is, the Dirac delta distribution that has its probability mass concentrated on zero. To circumvent the complexity induced by degenerate variates, the prior for a coefficient being excluded is a normal distribution with a small variance, for example $N(0,0.1^2)$. Because the prior is concentrated around zero, the posterior must also be concentrated around zero. The prior for the coefficient being included can be $N(\mu,V)$, where both $\mu$ and $V$ are sufficiently away from zero.

Consider the latent, binary random variables $\gamma_0$, $\gamma_1$, and $\gamma_2$ such that $\gamma_j = 1$ indicates that $\beta_j\sim N(\mu_j,\sigma^2V_j)$ and $\gamma_j = 0$ indicates that $\beta_j\sim N(0,0.01\sigma^2)$. The sample space of $\gamma = (\gamma_0, \gamma_1, \gamma_2)$ implies all eight permutations of models are included in the Bayesian variable selection scheme. Given the data, the value of $\gamma$ that is most probable corresponds to the desired Bayesian linear regression model. To include $\gamma$ in the model, you can marginalize it by summing over the probabilities of its values in its sample space. If $\beta = (\beta_0,\beta_1,\beta_2)$, then the joint prior probability distribution is this conjugate mixture distribution.

  • ${f_{\beta |{\sigma ^2}}}\left( {\beta ;{\sigma ^2}} \right) = \sum\limits_{i,j,k} {P\left( {\gamma  = (i,j,k)} \right)f(\beta ;{\mu _{ijk}},{\sigma ^2}{V_{ijk}})}$.
  • $\sigma^2\sim IG(a,b)$.

$f(\beta ;{\mu _{ijk}},{\sigma ^2}{V_{ijk}})$ is the PDF of the multivariate normal distribution with mean vector $\mu _{ijk}$ and diagonal covariance matrix ${\sigma ^2}{V_{ijk}}$. $\mu _{ijk}$ and $V_{ijk}$ consist of the means and variances of $\beta$ under $\gamma = (i,j,k)$.

The posterior $\beta,\sigma^2|y$ is analytically tractable. Each component is a normal-inverse-gamma mixture distribution.

Choose Hyperparameter Values

Prior hyperparameters include $\mu_m$, $V_m$, $a$, $b$, and $P(\gamma=(i,j,k))$, where $m$ = 0, 1, 2. Although you should conduct a sensitivity analysis, for now, you can arbitrarily choose values for the prior means, variance scales, shape, and scale parameters. However, when you choose priors for the probabilities of $\gamma_m$, consider their implications on the final model. For example, if your model needs an intercept, then

$$P(\gamma_0 = 1) = \sum_{j,k}P(\gamma = (1,j,k)) = 1.$$

Choose hyperparameter values so that the distribution of the regression coefficents is centered around zero and:

  • For regression coefficients that are in the model, the distribution is diffuse (large variance).
  • For regression coefficients that are not in the model, the distribuiton is concentrated (small variance).

Choose arbitrary values for the inverse gamma parameters and $P(\gamma_m = 1)$, $m$ = 0, 1, 2.

numCoeffs = 3;
mu = zeros(numCoeffs,1);
vin = 100*ones(numCoeffs,1);
vout = 0.01*ones(numCoeffs,1);
a = 3;
b = 1/3;
pGamma = [0.8; 0.3; 0.1];

Simulate Data

Simulate 1000 observations from the regression model

$$y_t = 1 + 2x_{t1} + 0.1x_{t2} + \varepsilon_t,$$

where

$$\left[ {\begin{array}{*{20}{c}}
{{x_1}}\\
{{x_2}}
\end{array}} \right]\sim{N_2}\left( {\left[ {\begin{array}{*{20}{c}}
1\\
2
\end{array}} \right],\left[ {\begin{array}{*{20}{c}}
2&1\\
1&2
\end{array}} \right]} \right)$$

and $\varepsilon_t\sim N(0,25)$.

rng('default'); % For reproducibility
numObs = 1000;
Beta0True = 1;
Beta1True = 2;
Beta2True = 0.1;
muX = [1 2];
SigmaX = eye(2)+ones(2);
X = mvnrnd(muX,SigmaX,numObs);
disturbance = 5*randn(numObs,1);
y = Beta0True + X*[Beta1True; Beta2True] + disturbance;

Joint Prior Distribution

Declare a function for the log of the joint prior distribution of $\beta$ and $\sigma^2$, marginalized over $\gamma$. Write the function so that:

  • The first input argument is a vector representing the values of the random coefficients and disturbance variance, $[\beta^\prime \sigma^2]^\prime$.
  • Other input arguments are values of the hyperparameters in any order.
  • The first output argument is the log of the joint prior density.
  • The second output argument is the gradient of the log of the joint prior density with respect to the coefficients and disturbance variance. The Hamiltonian Monte Carlo (HMC) sampler is the only sampler that uses the second output argument and it is optional. You can specify a NaN for any partial derivative that is difficult to compute. However, known analytical gradients improve the performance of the HMC sampler. MATLAB® resorts to computing the numerical partial derivative for any unknown partial derivatives.
function [logPrior,gradient] = logPDFBVS(params,mu,vin,vout,pGamma,a,b)
%logPDFBVS Log joint prior for Bayesian variable selection
%   logPDFBVS is the log of the joint prior density of a
%   normal-inverse-gamma mixture conjugate model for a Bayesian linear
%   regression model with numCoeffs coefficients.  logPDFBVS passes
%   params(1:end-1), the coefficients, to the PDF of a mixture of normal
%   distributions with hyperparameters mu, vin, vout, and pGamma, and also
%   passes params(end), the disturbance variance, to an inverse gamma
%   density with shape a and scale b.
%
%   params: Parameter values at which the densities are evaluated, a
%           (numCoeffs + 1)-by-1 numeric vector.  The first numCoeffs
%           elements correspond to the regression coefficients and the last
%           element corresponds to the disturbance variance.
%
%       mu: Multivariate normal component means, a numCoeffs-by-1 numeric
%       vector of prior means for the regression coefficients.
%
%      vin: Multivariate normal component scales, a numCoeffs-by-1 vector
%           of positive values for the regression coefficients that are
%           likely in the model.
%
%     vout: Multivariate normal component scales, a numCoeffs-by-1 vector
%           of positive values for the regression coefficients that are
%           likely not in the model.
%
%   pGamma: Prior probability that regression coefficients are in the
%           model, a numCoeffs-by-1 vector of positive values from 0
%           through 1.
%
%        a: Inverse gamma shape parameter, a positive numeric scalar.
%
%        b: Inverse gamma scale parameter, a positive scalar.
% 
Beta = params(1:end-1);
sigma2 = params(end);

priorSigma2 = -(a+1)*log(sigma2) - 1/(b*sigma2);

priorBeta0in = pGamma(1)*normpdf(Beta(1),mu(1),sqrt(sigma2*vin(1)));
priorBeta0out = (1 - pGamma(1))*normpdf(Beta(1),mu(1),sqrt(sigma2*vout(1)));
priorBeta0 = log(priorBeta0in + priorBeta0out);

priorBeta1in = pGamma(2)*normpdf(Beta(2),mu(2),sqrt(sigma2*vin(2)));
priorBeta1out = (1 - pGamma(2))*normpdf(Beta(2),mu(2),sqrt(sigma2*vout(2)));
priorBeta1 = log(priorBeta1in + priorBeta1out);

priorBeta2in = pGamma(3)*normpdf(Beta(3),mu(3),sqrt(sigma2*vin(3)));
priorBeta2out = (1 - pGamma(3))*normpdf(Beta(3),mu(3),sqrt(sigma2*vout(3)));
priorBeta2 = log(priorBeta2in + priorBeta2out);

logPrior = priorSigma2 + priorBeta0 + priorBeta1 + priorBeta2;

gradPriorBeta0 = 1/exp(priorBeta0)*...
    (pGamma(1)*normpdf(Beta(1),mu(1),sqrt(sigma2*vin(1)))*...
    (-(Beta(1) - mu(1))/(sigma2*vin(1))) + ...
    ((1 - pGamma(1))*normpdf(Beta(1),mu(1),sqrt(sigma2*vout(1)))*...
    (-(Beta(1) - mu(1))/(sigma2*vout(1)))));

gradPriorBeta1 = 1/exp(priorBeta1)*...
    (pGamma(2)*normpdf(Beta(2),mu(2),sqrt(sigma2*vin(2)))*...
    (-(Beta(2) - mu(2))/(sigma2*vin(2))) + ...
    ((1 - pGamma(2))*normpdf(Beta(2),mu(2),sqrt(sigma2*vout(2)))*...
    (-(Beta(2) - mu(2))/(sigma2*vout(2)))));
    
gradPriorBeta2 = 1/exp(priorBeta2)*...
    (pGamma(3)*normpdf(Beta(3),mu(3),sqrt(sigma2*vin(3)))*...
    (-(Beta(3) - mu(3))/(sigma2*vin(3))) + ...
    ((1 - pGamma(3))*normpdf(Beta(3),mu(3),sqrt(sigma2*vout(3)))*...
    (-(Beta(3) - mu(3))/(sigma2*vout(3)))));

gradPriorSigmaBeta0 = 1/exp(priorBeta0)*(...
    pGamma(1)*normpdf(Beta(1),mu(1),sqrt(sigma2*vin(1)))*...
    (0.5*(Beta(1) - mu(1))^2/vin(1)/sigma2^2 - 0.5/sigma2) +...
    (1 - pGamma(1))*normpdf(Beta(1),mu(1),sqrt(sigma2*vout(1)))*...
    (0.5*(Beta(1) - mu(1))^2/vout(1)/sigma2^2 - 0.5/sigma2));

gradPriorSigmaBeta1 = 1/exp(priorBeta1)*(...
    pGamma(2)*normpdf(Beta(2),mu(2),sqrt(sigma2*vin(2)))*...
    (0.5*(Beta(2) - mu(2))^2/vin(2)/sigma2^2 - 0.5/sigma2) +...
    (1 - pGamma(2))*normpdf(Beta(2),mu(2),sqrt(sigma2*vout(2)))*...
    (0.5*(Beta(2) - mu(2))^2/vout(2)/sigma2^2 - 0.5/sigma2));

gradPriorSigmaBeta2 = 1/exp(priorBeta2)*(...
    pGamma(3)*normpdf(Beta(3),mu(3),sqrt(sigma2*vin(3)))*...
    (0.5*(Beta(3) - mu(3))^2/vin(3)/sigma2^2 - 0.5/sigma2) +...
    (1 - pGamma(3))*normpdf(Beta(3),mu(3),sqrt(sigma2*vout(3)))*...
    (0.5*(Beta(3) - mu(3))^2/vout(3)/sigma2^2 - 0.5/sigma2));

gradPriorSigma2 = -(a+1)/sigma2 + 1/(b*sigma2^2) + gradPriorSigmaBeta0 + ...
    gradPriorSigmaBeta1 + gradPriorSigmaBeta2;

gradient = [gradPriorBeta0; gradPriorBeta1; gradPriorBeta2; gradPriorSigma2];
end

Create Bayesian Linear Regression Model

Create an anonymous function from the log of the joint prior distribution. The only input argument of the function is a vector containing the coefficients and disturbance variance.

logPDF = @(params)logPDFBVS(params,mu,vin,vout,pGamma,a,b);

Create a Bayesian linear regression model containing three coefficients, one of which is an intercept, using the custom joint prior distribution represented by logPDF.

Mdl = bayeslm(numCoeffs - 1,'ModelType','custom','LogPDF',logPDF);

Mdl is a customblm model object.

Specify Sampler Options

Bayesian linear regression models with custom priors treat the posterior as analytically intractable. Therefore, MCMC sampling is required to explore the posterior distribution. MATLAB® supports these samplers.

  • Random walk Metropolis sampler with multivariate normal or t proposal distributions
  • Slice sampler
  • Hamiltonian Monte Carlo (HMC) sampler

Each sampler has a set of hyperparameters that you can tune.

To compare the performance of the three samplers, create a sampler options structure for each sampler.

  • For the slice sampler, use the default typical width.
  • For the random walk Metropolis sampler, specify the multivariate t proposal. Specify three degrees of freedom.
  • For the HMC sampler, use the default options.
optionsS = sampleroptions('Sampler','slice');
optionsRWM = sampleroptions('Sampler','metropolis','Distribution','mvt',...
    'DegreeOfFreedom',3);
optionsHMC = sampleroptions('Sampler','hmc');

Estimate Posterior Distributions

Estimate the posterior distributions using each sampler. Specify 2000 draws, a burn-in sample of 1000, and, for numerical stability, reparameterization of the disturbance variance. Store the results in a cell array.

numDraws = 2000;
burnIn = 1000;
EstMdl = cell(3,1); % Preallocation
estBeta = zeros(numCoeffs,3);
options = {optionsS optionsRWM optionsHMC};
samplerNames = cellfun(@(x)x.Sampler,options,'UniformOutput',false);

for j = 1:3
    [EstMdl{j},estBeta(:,j)] = estimate(Mdl,X,y,'Options',options{j},...
        'NumDraws',numDraws,'BurnIn',burnIn,'Reparameterize',true);
end
Method: MCMC sampling with 2000 draws
Number of observations: 1000
Number of predictors:   3
 
           |   Mean     Std         CI95        Positive  Distribution 
-----------------------------------------------------------------------
 Intercept |  0.6011  0.2106  [ 0.202,  1.005]    0.999     Empirical  
 Beta(1)   |  1.9729  0.1251  [ 1.727,  2.198]    1.000     Empirical  
 Beta(2)   |  0.3699  0.0972  [ 0.191,  0.570]    1.000     Empirical  
 Sigma2    | 25.1563  1.1442  [22.892, 27.303]    1.000     Empirical  
 

Method: MCMC sampling with 2000 draws
Number of observations: 1000
Number of predictors:   3
 
           |   Mean     Std         CI95        Positive  Distribution 
-----------------------------------------------------------------------
 Intercept |  0.6408  0.2543  [ 0.143,  1.186]    0.995     Empirical  
 Beta(1)   |  1.9977  0.1243  [ 1.729,  2.221]    1.000     Empirical  
 Beta(2)   |  0.3464  0.1157  [ 0.132,  0.570]    0.999     Empirical  
 Sigma2    | 25.2375  1.1399  [23.021, 27.788]    1.000     Empirical  
 

Method: MCMC sampling with 2000 draws
Number of observations: 1000
Number of predictors:   3
 
           |   Mean     Std         CI95        Positive  Distribution 
-----------------------------------------------------------------------
 Intercept |  0.6388  0.2580  [ 0.133,  1.150]    0.992     Empirical  
 Beta(1)   |  1.9875  0.1358  [ 1.710,  2.251]    1.000     Empirical  
 Beta(2)   |  0.3543  0.1266  [ 0.116,  0.612]    0.996     Empirical  
 Sigma2    | 25.1477  1.0994  [23.046, 27.322]    1.000     Empirical  
 

EstMdl is a 3-by-1 cell array, and each cell contains an estimated Bayesian linear regression model (empiricalblm model object) derived from the corresponding MCMC sampler. estBeta is a 3-by-3 matrix, and each row contains the posterior means of the three coefficients.

The estimation displays at the command line show that the estimates are close to each other and fairly close to their true values.

MCMC Sampler Diagnostics

For each sampler and parameter, view trace plots for the MCMC draws.

for k = 1:3
    figure;
    for j = 1:3
        subplot(3,1,j);
        plot(EstMdl{j}.BetaDraws(k,:));
        title(sprintf('%s -- %s',samplerNames{j},EstMdl{j}.VarNames{k}));
    end
end

figure;
for j = 1:3
    subplot(3,1,j);
    plot(EstMdl{j}.Sigma2Draws);
    title(sprintf('%s -- Sigma2',samplerNames{j}));
end

The trace plots indicate that the Metropolis and slice samples have high autocorrelation and are mixing slowly. However, the HMC sample is mixing well.

Compare Analytic Solution and Simulation Results

Given a particular value of $\gamma$, the Bayesian linear regression model is a normal-inverse-gamma conjugate model. Compute the analytic posterior distribution for each of the eight models. Turn of all estimation display. Store the marginal likelihood value for each model.

scales = [0.01 100];
[g{numCoeffs:-1:1}] = ndgrid(1:numel(scales));
Idx = reshape(cat(numCoeffs+1,g{:}),[],numCoeffs);
V = scales(Idx); % Permuted scales

estBetaConjugate = zeros(size(V,1),numCoeffs);
logL = zeros(size(V,1),1);

for j = 1:size(V,1)
    MdlConjugate = bayeslm(numCoeffs - 1,'ModelType','conjugate',...
        'Mu',mu,'V',diag(V(j,:)),'A',a,'B',b);
    [~,estBetaConjugate(j,:)] = estimate(MdlConjugate,X,y,...
        'Display',false);
    logL(j) = marginY(MdlConjugate,X,y);
end

estBetaConjugate is an 8-by-3 numeric matrix. Rows correspond to estimated coefficients for models given particular values of $\gamma$. For example, estBetaConjugate(1,:) contains the estimated coefficients for the model in which all coefficients have a prior scale of 0.01.

Compute the posterior distribution marginalized over $\gamma$.

PIdx = abs(Idx - 2);
pGammaMat = PIdx + (-1).^PIdx.*pGamma'; % Permuted gamma probabilities

% Compute likelihoods
postProb = sum(log(pGammaMat),2) + logL;

% Normalize
maxProb = max(postProb);
postProb = exp(postProb - maxProb)/sum(exp(postProb - maxProb));

% Average of betas weighted by model probabilities
estBetaConjugate = sum(estBetaConjugate.*postProb)'
estBetaConjugate =

    0.6321
    1.9910
    0.3538

estBeta is the analytic estimate of the coefficients in the normal-inverse-gamma mixture model.

Compare estimated coefficients for all methods.

est = [[Beta0True; Beta1True; Beta2True] estBetaConjugate estBeta];
array2table(est,'VariableNames',{'True' 'Analytic' samplerNames{:}},...
    'RowNames',EstMdl{1}.VarNames)
ans =

  3x5 table

                 True    Analytic     Slice     Metropolis      HMC  
                 ____    ________    _______    __________    _______

    Intercept      1     0.63212     0.60109    0.64085       0.63875
    Beta(1)        2       1.991      1.9729     1.9977        1.9875
    Beta(2)      0.1     0.35379     0.36987    0.34643       0.35435

The analytic estimates are fairly close to those from MCMC sampling.

Estimate Model Indicator

The model indicator is the posterior probability of a particular model. This measure indicates whether a coefficient should be in the model. Estimate model indicators for all methods.

ProbCoeffIn = zeros(numCoeffs,4); % numCoeffs-by-numMethods

% Monte Carlo estimation for samplers
for i = 1:numCoeffs
    for j = 1:numel(samplerNames)
        pBetaIn = pGamma(i)*normpdf(EstMdl{j}.BetaDraws(i,:),mu(i),...
        sqrt(EstMdl{j}.Sigma2Draws.*vin(i)));
        pBetaOut = (1 - pGamma(i))*normpdf(EstMdl{j}.BetaDraws(i,:),mu(i),...
        sqrt(EstMdl{j}.Sigma2Draws.*vout(i)));
        gam = pBetaIn./(pBetaIn + pBetaOut) > rand(1,numDraws);
        ProbCoeffIn(i,j + 1) = mean(gam);
    end
end

% Analytical estimates
for j = 1:numCoeffs
    ProbCoeffIn(j,1) = sum(postProb.*(1 - PIdx(:,j)));
end

Compare the model indicators among the methods.

array2table(ProbCoeffIn,'VariableNames',{'Analytic' samplerNames{:}},...
    'RowNames',EstMdl{1}.VarNames)
ans =

  3x4 table

                 Analytic     Slice     Metropolis     HMC  
                 _________    ______    __________    ______

    Intercept      0.10725    0.0945    0.1035        0.1185
    Beta(1)        0.88582     0.885    0.8925         0.872
    Beta(2)      0.0014898    0.0015     0.002         0.001

The values are close to each other.

All methods suggest that $\beta_2 = 0.01$ should be removed from the model because its posterior probability is below 0.01.

In this case, the slice, Metropolis, and HMC samplers yield equivalent results. However, the quality of the properties of the samples drawn using the HMC sampler is much better than the quality of the other methods. Drawing inferences using statistics extracted from MCMC samples that do not mix well or have not reached their stationary distribution can lead to incorrect conclusions. You can increase the quality of the samples by tuning the sampler parameters, thinning the sample, increasing the number of draws, or reparameterizing the model.