Implement Bayesian Variable Selection
This example shows how to implement stochastic search variable selection (SSVS), a Bayesian variable selection technique. Also, the example compares the performance of several supported Markov chain Monte Carlo (MCMC) samplers.
Contents
Introduction
Consider this Bayesian linear regression model.
- .
- .
- , where is the inverse gamma distribution with shape a and scale b.
The goal of variable selection is to include only those predictors supported by data in the final regression model. One way to do this is to analyze the permutations of models, where models differ by the coefficients that are included. You first fit the models to data and then compare the models by using performance measures, such as goodness-of-fit or forecast error.
A Bayesian view of variable selection is a coefficient being excluded from a model with a degenerate prior distribution, that is, the Dirac delta distribution that has its probability mass concentrated on zero. To circumvent the complexity induced by degenerate variates, the prior for a coefficient being excluded is a normal distribution with a small variance, for example . Because the prior is concentrated around zero, the posterior must also be concentrated around zero. The prior for the coefficient being included can be , where both and are sufficiently away from zero.
Consider the latent, binary random variables , , and such that indicates that and indicates that . The sample space of implies all eight permutations of models are included in the Bayesian variable selection scheme. Given the data, the value of that is most probable corresponds to the desired Bayesian linear regression model. To include in the model, you can marginalize it by summing over the probabilities of its values in its sample space. If , then the joint prior probability distribution is this conjugate mixture distribution.
- .
- .
is the PDF of the multivariate normal distribution with mean vector and diagonal covariance matrix . and consist of the means and variances of under .
The posterior is analytically tractable. Each component is a normal-inverse-gamma mixture distribution.
Choose Hyperparameter Values
Prior hyperparameters include , , , , and , where = 0, 1, 2. Although you should conduct a sensitivity analysis, for now, you can arbitrarily choose values for the prior means, variance scales, shape, and scale parameters. However, when you choose priors for the probabilities of , consider their implications on the final model. For example, if your model needs an intercept, then
Choose hyperparameter values so that the distribution of the regression coefficents is centered around zero and:
- For regression coefficients that are in the model, the distribution is diffuse (large variance).
- For regression coefficients that are not in the model, the distribuiton is concentrated (small variance).
Choose arbitrary values for the inverse gamma parameters and , = 0, 1, 2.
numCoeffs = 3; mu = zeros(numCoeffs,1); vin = 100*ones(numCoeffs,1); vout = 0.01*ones(numCoeffs,1); a = 3; b = 1/3; pGamma = [0.8; 0.3; 0.1];
Simulate Data
Simulate 1000 observations from the regression model
where
and .
rng('default'); % For reproducibility numObs = 1000; Beta0True = 1; Beta1True = 2; Beta2True = 0.1; muX = [1 2]; SigmaX = eye(2)+ones(2); X = mvnrnd(muX,SigmaX,numObs); disturbance = 5*randn(numObs,1); y = Beta0True + X*[Beta1True; Beta2True] + disturbance;
Joint Prior Distribution
Declare a function for the log of the joint prior distribution of and , marginalized over . Write the function so that:
- The first input argument is a vector representing the values of the random coefficients and disturbance variance, .
- Other input arguments are values of the hyperparameters in any order.
- The first output argument is the log of the joint prior density.
- The second output argument is the gradient of the log of the joint prior density with respect to the coefficients and disturbance variance. The Hamiltonian Monte Carlo (HMC) sampler is the only sampler that uses the second output argument and it is optional. You can specify a NaN for any partial derivative that is difficult to compute. However, known analytical gradients improve the performance of the HMC sampler. MATLAB® resorts to computing the numerical partial derivative for any unknown partial derivatives.
function [logPrior,gradient] = logPDFBVS(params,mu,vin,vout,pGamma,a,b) %logPDFBVS Log joint prior for Bayesian variable selection % logPDFBVS is the log of the joint prior density of a % normal-inverse-gamma mixture conjugate model for a Bayesian linear % regression model with numCoeffs coefficients. logPDFBVS passes % params(1:end-1), the coefficients, to the PDF of a mixture of normal % distributions with hyperparameters mu, vin, vout, and pGamma, and also % passes params(end), the disturbance variance, to an inverse gamma % density with shape a and scale b. % % params: Parameter values at which the densities are evaluated, a % (numCoeffs + 1)-by-1 numeric vector. The first numCoeffs % elements correspond to the regression coefficients and the last % element corresponds to the disturbance variance. % % mu: Multivariate normal component means, a numCoeffs-by-1 numeric % vector of prior means for the regression coefficients. % % vin: Multivariate normal component scales, a numCoeffs-by-1 vector % of positive values for the regression coefficients that are % likely in the model. % % vout: Multivariate normal component scales, a numCoeffs-by-1 vector % of positive values for the regression coefficients that are % likely not in the model. % % pGamma: Prior probability that regression coefficients are in the % model, a numCoeffs-by-1 vector of positive values from 0 % through 1. % % a: Inverse gamma shape parameter, a positive numeric scalar. % % b: Inverse gamma scale parameter, a positive scalar. % Beta = params(1:end-1); sigma2 = params(end); priorSigma2 = -(a+1)*log(sigma2) - 1/(b*sigma2); priorBeta0in = pGamma(1)*normpdf(Beta(1),mu(1),sqrt(sigma2*vin(1))); priorBeta0out = (1 - pGamma(1))*normpdf(Beta(1),mu(1),sqrt(sigma2*vout(1))); priorBeta0 = log(priorBeta0in + priorBeta0out); priorBeta1in = pGamma(2)*normpdf(Beta(2),mu(2),sqrt(sigma2*vin(2))); priorBeta1out = (1 - pGamma(2))*normpdf(Beta(2),mu(2),sqrt(sigma2*vout(2))); priorBeta1 = log(priorBeta1in + priorBeta1out); priorBeta2in = pGamma(3)*normpdf(Beta(3),mu(3),sqrt(sigma2*vin(3))); priorBeta2out = (1 - pGamma(3))*normpdf(Beta(3),mu(3),sqrt(sigma2*vout(3))); priorBeta2 = log(priorBeta2in + priorBeta2out); logPrior = priorSigma2 + priorBeta0 + priorBeta1 + priorBeta2; gradPriorBeta0 = 1/exp(priorBeta0)*... (pGamma(1)*normpdf(Beta(1),mu(1),sqrt(sigma2*vin(1)))*... (-(Beta(1) - mu(1))/(sigma2*vin(1))) + ... ((1 - pGamma(1))*normpdf(Beta(1),mu(1),sqrt(sigma2*vout(1)))*... (-(Beta(1) - mu(1))/(sigma2*vout(1))))); gradPriorBeta1 = 1/exp(priorBeta1)*... (pGamma(2)*normpdf(Beta(2),mu(2),sqrt(sigma2*vin(2)))*... (-(Beta(2) - mu(2))/(sigma2*vin(2))) + ... ((1 - pGamma(2))*normpdf(Beta(2),mu(2),sqrt(sigma2*vout(2)))*... (-(Beta(2) - mu(2))/(sigma2*vout(2))))); gradPriorBeta2 = 1/exp(priorBeta2)*... (pGamma(3)*normpdf(Beta(3),mu(3),sqrt(sigma2*vin(3)))*... (-(Beta(3) - mu(3))/(sigma2*vin(3))) + ... ((1 - pGamma(3))*normpdf(Beta(3),mu(3),sqrt(sigma2*vout(3)))*... (-(Beta(3) - mu(3))/(sigma2*vout(3))))); gradPriorSigmaBeta0 = 1/exp(priorBeta0)*(... pGamma(1)*normpdf(Beta(1),mu(1),sqrt(sigma2*vin(1)))*... (0.5*(Beta(1) - mu(1))^2/vin(1)/sigma2^2 - 0.5/sigma2) +... (1 - pGamma(1))*normpdf(Beta(1),mu(1),sqrt(sigma2*vout(1)))*... (0.5*(Beta(1) - mu(1))^2/vout(1)/sigma2^2 - 0.5/sigma2)); gradPriorSigmaBeta1 = 1/exp(priorBeta1)*(... pGamma(2)*normpdf(Beta(2),mu(2),sqrt(sigma2*vin(2)))*... (0.5*(Beta(2) - mu(2))^2/vin(2)/sigma2^2 - 0.5/sigma2) +... (1 - pGamma(2))*normpdf(Beta(2),mu(2),sqrt(sigma2*vout(2)))*... (0.5*(Beta(2) - mu(2))^2/vout(2)/sigma2^2 - 0.5/sigma2)); gradPriorSigmaBeta2 = 1/exp(priorBeta2)*(... pGamma(3)*normpdf(Beta(3),mu(3),sqrt(sigma2*vin(3)))*... (0.5*(Beta(3) - mu(3))^2/vin(3)/sigma2^2 - 0.5/sigma2) +... (1 - pGamma(3))*normpdf(Beta(3),mu(3),sqrt(sigma2*vout(3)))*... (0.5*(Beta(3) - mu(3))^2/vout(3)/sigma2^2 - 0.5/sigma2)); gradPriorSigma2 = -(a+1)/sigma2 + 1/(b*sigma2^2) + gradPriorSigmaBeta0 + ... gradPriorSigmaBeta1 + gradPriorSigmaBeta2; gradient = [gradPriorBeta0; gradPriorBeta1; gradPriorBeta2; gradPriorSigma2]; end
Create Bayesian Linear Regression Model
Create an anonymous function from the log of the joint prior distribution. The only input argument of the function is a vector containing the coefficients and disturbance variance.
logPDF = @(params)logPDFBVS(params,mu,vin,vout,pGamma,a,b);
Create a Bayesian linear regression model containing three coefficients, one of which is an intercept, using the custom joint prior distribution represented by logPDF.
Mdl = bayeslm(numCoeffs - 1,'ModelType','custom','LogPDF',logPDF);
Mdl is a customblm model object.
Specify Sampler Options
Bayesian linear regression models with custom priors treat the posterior as analytically intractable. Therefore, MCMC sampling is required to explore the posterior distribution. MATLAB® supports these samplers.
- Random walk Metropolis sampler with multivariate normal or t proposal distributions
- Slice sampler
- Hamiltonian Monte Carlo (HMC) sampler
Each sampler has a set of hyperparameters that you can tune.
To compare the performance of the three samplers, create a sampler options structure for each sampler.
- For the slice sampler, use the default typical width.
- For the random walk Metropolis sampler, specify the multivariate t proposal. Specify three degrees of freedom.
- For the HMC sampler, use the default options.
optionsS = sampleroptions('Sampler','slice'); optionsRWM = sampleroptions('Sampler','metropolis','Distribution','mvt',... 'DegreeOfFreedom',3); optionsHMC = sampleroptions('Sampler','hmc');
Estimate Posterior Distributions
Estimate the posterior distributions using each sampler. Specify 2000 draws, a burn-in sample of 1000, and, for numerical stability, reparameterization of the disturbance variance. Store the results in a cell array.
numDraws = 2000; burnIn = 1000; EstMdl = cell(3,1); % Preallocation estBeta = zeros(numCoeffs,3); options = {optionsS optionsRWM optionsHMC}; samplerNames = cellfun(@(x)x.Sampler,options,'UniformOutput',false); for j = 1:3 [EstMdl{j},estBeta(:,j)] = estimate(Mdl,X,y,'Options',options{j},... 'NumDraws',numDraws,'BurnIn',burnIn,'Reparameterize',true); end
Method: MCMC sampling with 2000 draws Number of observations: 1000 Number of predictors: 3 | Mean Std CI95 Positive Distribution ----------------------------------------------------------------------- Intercept | 0.6011 0.2106 [ 0.202, 1.005] 0.999 Empirical Beta(1) | 1.9729 0.1251 [ 1.727, 2.198] 1.000 Empirical Beta(2) | 0.3699 0.0972 [ 0.191, 0.570] 1.000 Empirical Sigma2 | 25.1563 1.1442 [22.892, 27.303] 1.000 Empirical Method: MCMC sampling with 2000 draws Number of observations: 1000 Number of predictors: 3 | Mean Std CI95 Positive Distribution ----------------------------------------------------------------------- Intercept | 0.6408 0.2543 [ 0.143, 1.186] 0.995 Empirical Beta(1) | 1.9977 0.1243 [ 1.729, 2.221] 1.000 Empirical Beta(2) | 0.3464 0.1157 [ 0.132, 0.570] 0.999 Empirical Sigma2 | 25.2375 1.1399 [23.021, 27.788] 1.000 Empirical Method: MCMC sampling with 2000 draws Number of observations: 1000 Number of predictors: 3 | Mean Std CI95 Positive Distribution ----------------------------------------------------------------------- Intercept | 0.6388 0.2580 [ 0.133, 1.150] 0.992 Empirical Beta(1) | 1.9875 0.1358 [ 1.710, 2.251] 1.000 Empirical Beta(2) | 0.3543 0.1266 [ 0.116, 0.612] 0.996 Empirical Sigma2 | 25.1477 1.0994 [23.046, 27.322] 1.000 Empirical
EstMdl is a 3-by-1 cell array, and each cell contains an estimated Bayesian linear regression model (empiricalblm model object) derived from the corresponding MCMC sampler. estBeta is a 3-by-3 matrix, and each row contains the posterior means of the three coefficients.
The estimation displays at the command line show that the estimates are close to each other and fairly close to their true values.
MCMC Sampler Diagnostics
For each sampler and parameter, view trace plots for the MCMC draws.
for k = 1:3 figure; for j = 1:3 subplot(3,1,j); plot(EstMdl{j}.BetaDraws(k,:)); title(sprintf('%s -- %s',samplerNames{j},EstMdl{j}.VarNames{k})); end end figure; for j = 1:3 subplot(3,1,j); plot(EstMdl{j}.Sigma2Draws); title(sprintf('%s -- Sigma2',samplerNames{j})); end
The trace plots indicate that the Metropolis and slice samples have high autocorrelation and are mixing slowly. However, the HMC sample is mixing well.
Compare Analytic Solution and Simulation Results
Given a particular value of , the Bayesian linear regression model is a normal-inverse-gamma conjugate model. Compute the analytic posterior distribution for each of the eight models. Turn of all estimation display. Store the marginal likelihood value for each model.
scales = [0.01 100]; [g{numCoeffs:-1:1}] = ndgrid(1:numel(scales)); Idx = reshape(cat(numCoeffs+1,g{:}),[],numCoeffs); V = scales(Idx); % Permuted scales estBetaConjugate = zeros(size(V,1),numCoeffs); logL = zeros(size(V,1),1); for j = 1:size(V,1) MdlConjugate = bayeslm(numCoeffs - 1,'ModelType','conjugate',... 'Mu',mu,'V',diag(V(j,:)),'A',a,'B',b); [~,estBetaConjugate(j,:)] = estimate(MdlConjugate,X,y,... 'Display',false); logL(j) = marginY(MdlConjugate,X,y); end
estBetaConjugate is an 8-by-3 numeric matrix. Rows correspond to estimated coefficients for models given particular values of . For example, estBetaConjugate(1,:) contains the estimated coefficients for the model in which all coefficients have a prior scale of 0.01.
Compute the posterior distribution marginalized over .
PIdx = abs(Idx - 2); pGammaMat = PIdx + (-1).^PIdx.*pGamma'; % Permuted gamma probabilities % Compute likelihoods postProb = sum(log(pGammaMat),2) + logL; % Normalize maxProb = max(postProb); postProb = exp(postProb - maxProb)/sum(exp(postProb - maxProb)); % Average of betas weighted by model probabilities estBetaConjugate = sum(estBetaConjugate.*postProb)'
estBetaConjugate = 0.6321 1.9910 0.3538
estBeta is the analytic estimate of the coefficients in the normal-inverse-gamma mixture model.
Compare estimated coefficients for all methods.
est = [[Beta0True; Beta1True; Beta2True] estBetaConjugate estBeta]; array2table(est,'VariableNames',{'True' 'Analytic' samplerNames{:}},... 'RowNames',EstMdl{1}.VarNames)
ans = 3x5 table True Analytic Slice Metropolis HMC ____ ________ _______ __________ _______ Intercept 1 0.63212 0.60109 0.64085 0.63875 Beta(1) 2 1.991 1.9729 1.9977 1.9875 Beta(2) 0.1 0.35379 0.36987 0.34643 0.35435
The analytic estimates are fairly close to those from MCMC sampling.
Estimate Model Indicator
The model indicator is the posterior probability of a particular model. This measure indicates whether a coefficient should be in the model. Estimate model indicators for all methods.
ProbCoeffIn = zeros(numCoeffs,4); % numCoeffs-by-numMethods % Monte Carlo estimation for samplers for i = 1:numCoeffs for j = 1:numel(samplerNames) pBetaIn = pGamma(i)*normpdf(EstMdl{j}.BetaDraws(i,:),mu(i),... sqrt(EstMdl{j}.Sigma2Draws.*vin(i))); pBetaOut = (1 - pGamma(i))*normpdf(EstMdl{j}.BetaDraws(i,:),mu(i),... sqrt(EstMdl{j}.Sigma2Draws.*vout(i))); gam = pBetaIn./(pBetaIn + pBetaOut) > rand(1,numDraws); ProbCoeffIn(i,j + 1) = mean(gam); end end % Analytical estimates for j = 1:numCoeffs ProbCoeffIn(j,1) = sum(postProb.*(1 - PIdx(:,j))); end
Compare the model indicators among the methods.
array2table(ProbCoeffIn,'VariableNames',{'Analytic' samplerNames{:}},... 'RowNames',EstMdl{1}.VarNames)
ans = 3x4 table Analytic Slice Metropolis HMC _________ ______ __________ ______ Intercept 0.10725 0.0945 0.1035 0.1185 Beta(1) 0.88582 0.885 0.8925 0.872 Beta(2) 0.0014898 0.0015 0.002 0.001
The values are close to each other.
All methods suggest that should be removed from the model because its posterior probability is below 0.01.
In this case, the slice, Metropolis, and HMC samplers yield equivalent results. However, the quality of the properties of the samples drawn using the HMC sampler is much better than the quality of the other methods. Drawing inferences using statistics extracted from MCMC samples that do not mix well or have not reached their stationary distribution can lead to incorrect conclusions. You can increase the quality of the samples by tuning the sampler parameters, thinning the sample, increasing the number of draws, or reparameterizing the model.