Bayesian Linear Regression

Linear regression is a statistical tool used to:

  • Study the linear dependencies or influences of predictor or explanatory variables on response variables.

  • Predict or forecast future responses given future predictor data.

The multiple linear regression (MLR) model is

yt=xtβ+εt.

For times t = 1,...,T:

  • yt is the observed response.

  • xt is a 1-by-(p + 1) row vector of observed values of p predictors. To accommodate a model intercept, x1t = 1 for all t.

  • β is a (p + 1)-by-1 column vector of regression coefficients corresponding to the variables that compose the columns of xt.

  • εt is the random disturbance that have a mean of zero and Cov(ε) = Ω. In general, Ω is a T-by-T symmetric, positive definite matrix. For simplicity, assume the disturbances are uncorrelated and have common variance, that is, Ω = σ2IT×T.

The values of β represent the expected marginal contributions of the corresponding predictors to yt. When the predictor xj increases by one unit, y is expected to increase by βj units, assuming all other variables are held fixed. εt is the random difference between the true and expected response at time t.

Classical Versus Bayesian Analyses

To study the linear influences of the predictors on the response, or to build a predictive MLR, you must first estimate the parameters β and σ2. Frequentist statisticians use the classical approach to estimation, that is, they treat the parameters as fixed but unknown quantities. Popular frequentist estimation tools include least squares and maximum likelihood. If the disturbances are independent, homoscedastic, and Gaussian or normal (Statistics and Machine Learning Toolbox), then least squares and maximum likelihood yield equivalent estimates. Inferences, such as confidence intervals on the parameter estimates or prediction intervals, are based on the distribution of the disturbances. For more on the frequentist approach to MLR analysis, see Time Series Regression I: Linear Models or [6], Ch. 3. Most tools in Econometrics Toolbox™ are frequentist.

A Bayesian approach to estimation and inference of MLR models treats β and σ2 as random variables rather than fixed, unknown quantities. In general, the goal of a Bayesian analysis is to update the probability distributions of the parameters by incorporating information about the parameters from observing the data. Prior to sampling the data, you have some beliefs about the joint distribution of the parameters. After sampling, you combine the likelihood induced by the distribution of the data with your prior beliefs to compose a joint conditional distribution of the parameters given the data. Features and functions of the resulting distribution are the basis for estimation and inference.

Main Bayesian Analysis Components

One of the main goals of a Bayesian analysis is to compute, or sample from, the posterior distribution (or posterior). The posterior is the distribution of the parameters updated using (or given) the data, and is composed of these quantities:

  • A likelihood function — The information that the sample provides about the parameters. If you take random sample, then the likelihood for MLR is

    (β,σ2|y,x)=t=1TP(yt|xt,β,σ2).

    P(yt|xt,β,σ2) is the conditional probability density function of yt given the parameters and induced by the conditional distribution of εt. Usually, xt is a fixed quantity. If the disturbances are independent, homoscedastic, and Gaussian, then

    (β,σ2|y,x)=t=1Tϕ(yt;xtβ,σ2).

    ϕ(yt;xtβ,σ2) is the Gaussian probability density with mean xtβ and variance σ2, evaluated at yt.

  • Prior distributions (or priors) on the parameters — The distribution of the parameters that you assume before observing the data. Imposing prior distribution assumptions on parameters has an advantage over frequentist analyses: priors allow you to incorporate knowledge about the model before viewing the data. You can control the confidence in your knowledge about the parameter by adjusting the prior variance. Specifying a high variance implies that you know very little about the parameter, and you want to weigh the information in the data about the parameters more heavily. Specifying a low variance implies high confidence in your knowledge about the parameter, and you want to account for that knowledge in the analysis.

    In practice, you use priors for convenience rather than to follow the opinion of a researcher about the actual distribution of the parameters. For example, you can choose priors so that the corresponding posterior distribution is in the same family of distributions. These prior-posterior pairs are called conjugate distributions. However, the choice of priors can influence estimation and inference, so you should perform a sensitivity analysis with estimation.

    Priors can contain parameters, called hyperparameters, that can have probability distributions themselves. Such models are called hierarchical Bayesian models.

    For MLR, prior distributions are typically denoted as π(β) and π(σ2). A popular choice is the normal-inverse-gamma conjugate model, in which π(β|σ2) is the multivariate Gaussian or multivariate normal (Statistics and Machine Learning Toolbox) distribution and π(σ2) is the inverse gamma distribution.

You can contain the joint posterior distribution of β and σ2 using Bayes’s Rule, that is,

π(β,σ2|y,x)=π(β)π(σ2)(β,σ2|y,x)β,σ2π(β)π(σ2)(β,σ2|y,x)dβdσ2π(β)π(σ2)(β,σ2|y,x).

If β depends on σ2, then its prior should be replaced with π(β|σ2). The denominator is the distribution of the response given the predictors, and it becomes a constant after you observe y. Therefore, the posterior is often written as being proportional to the numerator.

The posterior is like any other joint probability distribution of random variables, and it contains all of the information known about the parameters after you incorporate the data. Parameter estimates and inferences are based mainly on integrals of functions of the parameters with respect to the posterior distribution.

Posterior Estimation and Inference

Posterior estimation and inference involve integrating functions of parameters with respect to the posterior. Popular estimators and inferences for MLR parameters include the following:

  • The expected value of β given the data is

    β^=E(β|y,x)=β,σ2βπ(β,σ2|y,x)dβdσ2.

    This quantity provides a natural interpretation and is the minimum mean squared error (MSE) estimator, that is, it minimizes E[(β^β)2|y,x]. The median, mode, or a quantile can be Bayes estimators, with respect to other losses.

  • The maximum a priori estimate (MAP) — The value of the parameter that maximizes the posterior distribution.

  • Given the data, the predicted response y^ of the predictor x^ is a random variable with the posterior predictive distribution

    π(y^|y,x,x^)=β,σ2f(y^|β,σ,x^)π(β,σ2|y,x)dβdσ2.

    You can view this quantity as the conditional expected value of the probability distribution of y with respect to the posterior distribution of the parameters.

  • A 95% confidence interval on β (or credible interval) — set S such that P(βS|y,x) = 0.95. This equation yields infinitely many intervals, including the:

    • Equitailed interval, which is the interval (L,U) such that P(β < L|y,x) = 0.025 and P(β > U|y,x) = 0.025.

    • Highest posterior density (HPD) region, which is the narrowest interval (or intervals) yielding the specified probability. It necessarily contains the greatest posterior values.

    Unlike the interpretation of frequentist confidence intervals, the interpretation of Bayesian confidence intervals is that given the data, the probability that a random β is in the interval(s) S is 0.95. This interpretation is intuitive, which is an advantage of Bayesian confidence intervals over frequentist confidence intervals.

  • Marginal posterior probabilities of variable inclusion, also called regime probabilities, result from implementing stochastic search variable selection (SSVS) and indicate whether predictor variables are insignificant or redundant in a Bayesian linear regression model. In SSVS, β has a multivariate, two-component Gaussian mixture distribution. Both components have a mean of zero, but one component has a large variance and the other component has a small variance. Insignificant predictors are likely to be close to zero; therefore, they are from the component with the small variance. SSVS samples from the space of 2p + 1 permutations of a model, each permutation includes or excludes a coefficient, and models with the highest posterior density are sampled more often. Regime probabilities are derived from the sampled models.

Integration methods depend on the functional form of the product π(β)π(σ2)(β,σ2|y,x) and the integrand, for example, h(β,σ2).

  • If the product forms the kernel of a known probability distribution, then integrals of h(β,σ2) with respect to the posterior can be analytically tractable. Known kernels often arise when you choose priors and posteriors to form conjugate pairs. In these cases, the first several moments of the distribution are typically known, and estimates are based off them. For details on the analytically tractable posterior distributions offered by the Bayesian linear regression model framework in Econometrics Toolbox, see Analytically Tractable Posteriors.

  • Otherwise, you must use numerical integration techniques to compute integrals of h(β,σ2) with respect to posterior distributions. Under certain conditions, you can implement numerical integration using Monte Carlo or Markov chain Monte Carlo (MCMC) sampling.

    • To perform Monte Carlo estimation, you draw many samples from a probability distribution, apply an appropriate function to each draw (h(β,σ2) is a factor in the function), and average the resulting draws to approximate the integral. A popular Monte Carlo technique is sampling importance resampling [6].

    • You implement MCMC when you do not know the probability distribution up to a constant, or you know the conditional distributions of all parameters at least up to a constant. Popular MCMC techniques include Gibbs sampling [2], the Metropolis-Hastings algorithm [5], and slice sampling [9].

    For details on posterior estimation of a Bayesian linear regression model in Econometrics Toolbox when the posterior is intractable, see Analytically Intractable Posteriors.

Analytically Tractable Posteriors

The Bayesian linear regression framework in Econometrics Toolbox offers several prior model specifications that yield analytically tractable, conjugate marginal or conditional posteriors. This table identifies the prior models and their corresponding posteriors. When you pass a prior model and data to estimate, MATLAB® uses these formulae. When the software constructs posteriors, it assumes that the response data yt, t = 1,...,T, is a random sample from a Gaussian distribution with mean xtβ and variance σ2.

Prior Model ObjectPriorsMarginal PosteriorsConditional Posteriors
conjugateblm

β|σ2~Np+1(μ,σ2V).σ2~IG(A,B).

β and σ2 are independent.

β|y,x~tp+1((V1+XX)1[(XX)β^+V1μ],2B1+(yXβ^)(yXβ^)+(β^μ)[V+(XX)1]1(β^μ)2A+T,2A+T).σ2|y,x~IG(A+T2,[B1+12(yXβ^)(yXβ^)+12(β^μ)[V+(XX)1]1(β^μ)]1).

β|σ2,y,x~Np+1((V1+XX)1[(XX)β^+V1μ],σ2(V1+XX)1).σ2|β,y,x~IG(A+T+p+12,[B1+12(yXβ)(yXβ)+12(βμ)V1(βμ)]1).
semiconjugateblm

β|σ2~Np+1(μ,V).σ2~IG(A,B).

β and σ2 are dependent.

Analytically intractable

β|σ2,y,x~Np+1((V1+σ2XX)1[σ2(XX)β^+V1μ],(V1+XX)1).σ2|β,y,x~IG(A+T2,[B1+12(yXβ)(yXβ)]1).

diffuseblm

The joint prior pdf is

fβ,σ2(β,σ2)1σ2.

β|y,x~tp+1(β^,(yXβ^)(yXβ^)Tp1(XX)1,Tp1).σ2|y,x~IG(Tp12,[12(yXβ^)(yXβ^)]1).

β|σ2,y,x~Np+1(β^,σ2(XX)1).σ2|β,y,x~IG(T2,[12(yXβ)(yXβ)]1).

mixconjugateblm

γ={γ1,...,γp+1}~p(γ).j,γj{0,1}.j,βj|σ2,γj=γjσVj1Z1+(1γj)σVj2Z2.Zk~N(0,1);k=1,2.σ2~IG(A,B).

Although the marginal posteriors are analytically tractable, MATLAB treats them as intractable for scalability (see [1]).

Analytically tractable if γj and γk are independent, for all jk

γj|β,γj,σ2,X,y~Bernoulli(ajaj+bj);j=1,...,p+1.j,aj=P(γj=1)ϕ(0,σ2Vj1).j,bj=P(γj=0)ϕ(0,σ2Vj2).β|σ2,γ,X,y~Np+1((V1+XX)1XY,σ2(V1+XX)1).σ2|β,γ,X,y~IG(A+T+p+12,[B1+12(yXβ)(yXβ)+12βV1β]1).

mixsemiconjugateblm

γ={γ1,...,γp+1}~p(γ).j,γj{0,1}.j,βj|σ2,γj=γjVj1Z1+(1γj)Vj2Z2.Zk~N(0,1);k=1,2.σ2~IG(A,B).

Analytically intractable

Analytically tractable if γj and γk are independent, for all jk

γj|β,γj,σ2,X,y~Bernoulli(ajaj+bj);j=1,...,p+1.j,aj=P(γj=1)ϕ(0,Vj1).j,bj=P(γj=0)ϕ(0,Vj2).β|σ2,γ,X,y~Np+1((V1+σ2XX)1XY,(V1+σ2XX)1).σ2|β,γ,X,y~IG(A+T2,[B1+12(yXβ)(yXβ)]1).

lassoblm

βj|σ2,λ~Laplace(0,σ/λ);j=0,..,p.σ2~IG(A,B).

Coefficients are independent, a priori.

Analytically intractable

1ψj|βj,σ2,λ~InvGaussian(σλ/|βj|,λ2);j=1,...,p+1.D=diag(ψ1,...,ψp+1).β|σ2,λ,X,y,ψ~Np+1((XX+D)1Xy,σ2(XX+D)1).σ2|β,X,y,ψ~IG(A+T+p+12,[B1+12(yXβ)(yXβ)+12βDβ]1).

In the table:

  • Np+1(m,Σ) denotes the (p + 1)-dimensional multivariate normal distribution, where m is the mean (a (p + 1)-by-1 vector) and Σ is the variance (a (p + 1)-by-(p + 1) symmetric, positive definite matrix).

  • IG(A,B) denotes the inverse gamma distribution with shape A > 0 and scale B > 0. The pdf of an IG(A,B) is

    f(x;A,B)=1Γ(A)BAxA1e1xB.

  • X is a T-by-(p + 1) matrix of predictor data, that is, xjk is observation j of predictor k. The first column is composed entirely of ones for the intercept.

  • y is a T-by-1 vector of responses.

  • tp+1(m,Σ,ν) denotes the (p + 1)-dimensional multivariate t distribution, where m is the location, Σ is the scale, and ν is the degrees of freedom.

  • β^=(XX)1Xy, that is, the least-squares estimate of β.

  • V*j1 is the prior variance factor (mixconjugate) or variance (mixsemiconjugate) of βj when γj = 1, and V*j2 is its prior variance factor or variance when γj = 0.

  • V* is a (p + 1)-by-(p + 1) diagonal matrix, and element j,j is γjV*j1 + (1 – γj)V*j2.

  • mixconjugateblm and mixsemiconjugateblm models support prior mean specifications for β other than the default zero vector for both components of the Gaussian mixture model. If you change the default prior mean β, then the corresponding conditional posterior distributions include the prior means in the same way that the conditional posterior distributions of conjugateblm and semiconjugateblm models include the prior means.

  • λ is the fixed lasso shrinkage parameter.

  • InvGaussian(m,v) denotes the inverse Gaussian (Wald) with mean m and shape v.

Analytically Intractable Posteriors

The Bayesian linear regression framework in Econometrics Toolbox offers several prior model specifications that yield analytically intractable, but flexible, marginal and conditional posteriors. This table identifies the prior models and the Monte Carlo sampling techniques that MATLAB uses to perform posterior estimation, simulation, and inference when you pass a prior model and data to estimate, simulate, or forecast.

Prior Model ObjectPriorsSimulation Technique for Marginal PosteriorSimulation Technique for Conditional Posterior
semiconjugateblm

β|σ2~Np+1(μ,V).σ2~IG(A,B).

β and σ2 are dependent.

Gibbs sampler [2]Conditional posterior is analytically tractable
empiricalblmCharacterized by draws from the respective prior distributionsSampling importance resampling [4]Not supported
customblmCharacterized by the joint pdf. in a declared function
  • Hamiltonian Monte Carlo sampler [8]

  • Random walk Metropolis sampler [7]

  • Slice sampler [9]

  • Hamiltonian Monte Carlo sampler

  • Random walk Metropolis sampler

  • Slice sampler

mixconjugateblm

γ={γ1,...,γp+1}~p(γ).j,γj{0,1}.j,βj|σ2,γj=γjσVj1Z1+(1γj)σVj2Z2.Zk~N(0,1);k=1,2.σ2~IG(A,B).

Gibbs sampler [1]Conditional posterior is analytically tractable
mixsemiconjugateblm

γ={γ1,...,γp+1}~p(γ).j,γj{0,1}.j,βj|σ2,γj=γjVj1Z1+(1γj)Vj2Z2.Zk~N(0,1);k=1,2.σ2~IG(A,B).

Gibbs sampler [1]Conditional posterior is analytically tractable
lassoblm

βj|σ2,λ~Laplace(0,σ/λ);j=0,..,p.σ2~IG(A,B).

Coefficients are independent, a priori.

Gibbs sampler [10] Conditional posterior is analytically tractable

References

[1] George, E. I., and R. E. McCulloch. "Variable Selection Via Gibbs Sampling." Journal of the American Statistical Association. Vol. 88, No. 423, 1993, pp. 881–889.

[2] Gelfand, A. E., and A. F. M. Smith. “Sampling-Based Approaches to Calculating Marginal Densities.” Journal of the American Statistical Association. Vol. 85, 1990, pp. 398–409.

[3] Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis, 2nd. Ed. Boca Raton, FL: Chapman & Hall/CRC, 2004.

[4] Gordon, N. J., D. J. Salmond, and A. F. M. Smith. "Novel Approach to Nonlinear/Non-Gaussian Bayesian State Estimation." IEEE Proceedings F on Radar and Signal Processing. Vol. 140, 1993, pp. 107–113.

[5] Hastings, W. K. “Monte Carlo Sampling Methods Using Markov Chains and Their Applications.” Biometrika. Vol. 57, 1970, pp. 97–109.

[6] Marin, J. M., and C. P. Robert. Bayesian Core: A Practical Approach to Computational Bayesian Statistics. New York: Springer Science+Business Media, LLC, 2007.

[7] Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. "Equations of State Calculations by Fast Computing Machine." J. Chem. Phys. Vol. 21, 1953, pp. 1087–1091.

[8] Neal, R. M. "MCMC using Hamiltonian dynamics." In S. Brooks, A. Gelman, G. Jones, and X.-L. Meng (eds.) Handbook of Markov Chain Monte Carlo. Boca Raton, FL: Chapman & Hall/CRC, 2011.

[9] Neal, R. M. “Slice Sampling.” The Annals of Statistics. Vol. 31, 2003, pp. 705–767.

[10] Park, T., and G. Casella. "The Bayesian Lasso." Journal of the American Statistical Association. Vol. 103, No. 482, 2008, pp. 681–686.

See Also

| | | | | | |

Related Topics