Bayesian Linear Regression

Linear regression is a statistical tool used to:

Study the linear dependencies or influences of predictor or explanatory variables on response variables.
Predict or forecast future responses given future predictor data.

The multiple linear regression (MLR) model is

$y_{t} = x_{t} β + ε_{t} .$

For times t = 1,...,T:

y_t is the observed response.
x_t is a 1-by-(p + 1) row vector of observed values of p predictors. To accommodate a model intercept, x_1t = 1 for all t.
β is a (p + 1)-by-1 column vector of regression coefficients corresponding to the variables that compose the columns of x_t.
ε_t is the random disturbance that have a mean of zero and Cov(ε) = Ω. In general, Ω is a T-by-T symmetric, positive definite matrix. For simplicity, assume the disturbances are uncorrelated and have common variance, that is, Ω = σ²I_T×T.

The values of β represent the expected marginal contributions of the corresponding predictors to y_t. When the predictor x_j increases by one unit, y is expected to increase by β_j units, assuming all other variables are held fixed. ε_t is the random difference between the true and expected response at time t.

Classical Versus Bayesian Analyses

To study the linear influences of the predictors on the response, or to build a predictive MLR, you must first estimate the parameters β and σ². Frequentist statisticians use the classical approach to estimation, that is, they treat the parameters as fixed but unknown quantities. Popular frequentist estimation tools include least squares and maximum likelihood. If the disturbances are independent, homoscedastic, and Gaussian or normal, then least squares and maximum likelihood yield equivalent estimates. Inferences, such as confidence intervals on the parameter estimates or prediction intervals, are based on the distribution of the disturbances. For more on the frequentist approach to MLR analysis, see Time Series Regression I: Linear Models or [6], Ch. 3. Most tools in Econometrics Toolbox™ are frequentist.

A Bayesian approach to estimation and inference of MLR models treats β and σ² as random variables rather than fixed, unknown quantities. In general, the goal of a Bayesian analysis is to update the probability distributions of the parameters by incorporating information about the parameters from observing the data. Prior to sampling the data, you have some beliefs about the joint distribution of the parameters. After sampling, you combine the likelihood induced by the distribution of the data with your prior beliefs to compose a joint conditional distribution of the parameters given the data. Features and functions of the resulting distribution are the basis for estimation and inference.

Main Bayesian Analysis Components

One of the main goals of a Bayesian analysis is to compute, or sample from, the posterior distribution (or posterior). The posterior is the distribution of the parameters updated using (or given) the data, and is composed of these quantities:

A likelihood function — The information that the sample provides about the parameters. If you take random sample, then the likelihood for MLR is

$ℓ (β, σ^{2} | y, x) = \prod_{t = 1}^{T} P (y_{t} | x_{t}, β, σ^{2}) .$
$P (y_{t} | x_{t}, β, σ^{2})$ is the conditional probability density function of y_t given the parameters and induced by the conditional distribution of ε_t. Usually, x_t is a fixed quantity. If the disturbances are independent, homoscedastic, and Gaussian, then

$ℓ (β, σ^{2} | y, x) = \prod_{t = 1}^{T} ϕ (y_{t}; x_{t} β, σ^{2}) .$
ϕ(y_t;x_tβ,σ²) is the Gaussian probability density with mean x_tβ and variance σ², evaluated at y_t.
Prior distributions (or priors) on the parameters — The distribution of the parameters that you assume before observing the data. Imposing prior distribution assumptions on parameters has an advantage over frequentist analyses: priors allow you to incorporate knowledge about the model before viewing the data. You can control the confidence in your knowledge about the parameter by adjusting the prior variance. Specifying a high variance implies that you know very little about the parameter, and you want to weigh the information in the data about the parameters more heavily. Specifying a low variance implies high confidence in your knowledge about the parameter, and you want to account for that knowledge in the analysis.
In practice, you use priors for convenience rather than to follow the opinion of a researcher about the actual distribution of the parameters. For example, you can choose priors so that the corresponding posterior distribution is in the same family of distributions. These prior-posterior pairs are called conjugate distributions. However, the choice of priors can influence estimation and inference, so you should perform a sensitivity analysis with estimation.
Priors can contain parameters, called hyperparameters, that can have probability distributions themselves. Such models are called hierarchical Bayesian models.
For MLR, prior distributions are typically denoted as π(β) and π(σ²). A popular choice is the normal-inverse-gamma conjugate model, in which π(β|σ²) is the multivariate Gaussian or multivariate normal distribution and π(σ²) is the inverse gamma distribution.

You can contain the joint posterior distribution of β and σ² using Bayes’s Rule, that is,

$π (β, σ^{2} | y, x) = \frac{π (β) π (σ^{2}) ℓ (β, σ^{2} | y, x)}{\int_{β, σ^{2}} π (β) π (σ^{2}) ℓ (β, σ^{2} | y, x) d β d σ^{2}} \propto π (β) π (σ^{2}) ℓ (β, σ^{2} | y, x) .$

If β depends on σ², then its prior should be replaced with π(β|σ²). The denominator is the distribution of the response given the predictors, and it becomes a constant after you observe y. Therefore, the posterior is often written as being proportional to the numerator.

The posterior is like any other joint probability distribution of random variables, and it contains all of the information known about the parameters after you incorporate the data. Parameter estimates and inferences are based mainly on integrals of functions of the parameters with respect to the posterior distribution.

Posterior Estimation and Inference

Posterior estimation and inference involve integrating functions of parameters with respect to the posterior. Popular estimators and inferences for MLR parameters include the following:

The expected value of β given the data is

$\hat{β} = E (β | y, x) = \int_{β, σ^{2}} β π (β, σ^{2} | y, x) d β d σ^{2} .$
This quantity provides a natural interpretation and is the minimum mean squared error (MSE) estimator, that is, it minimizes $E [{(\hat{β} - β)}^{2} | y, x] .$ The median, mode, or a quantile can be Bayes estimators, with respect to other losses.
The maximum a priori estimate (MAP) — The value of the parameter that maximizes the posterior distribution.
Given the data, the predicted response $\hat{y}$ of the predictor $\hat{x}$ is a random variable with the posterior predictive distribution

$π (\hat{y} | y, x, \hat{x}) = \int_{β, σ^{2}} f (\hat{y} | β, σ, \hat{x}) π (β, σ^{2} | y, x) d β d σ^{2} .$
You can view this quantity as the conditional expected value of the probability distribution of y with respect to the posterior distribution of the parameters.
A 95% confidence interval on β (or credible interval) — set S such that P(β ∊ S|y,x) = 0.95. This equation yields infinitely many intervals, including the:
- Equitailed interval, which is the interval (L,U) such that P(β < L|y,x) = 0.025 and P(β > U|y,x) = 0.025.
- Highest posterior density (HPD) region, which is the narrowest interval (or intervals) yielding the specified probability. It necessarily contains the greatest posterior values.
Unlike the interpretation of frequentist confidence intervals, the interpretation of Bayesian confidence intervals is that given the data, the probability that a random β is in the interval(s) S is 0.95. This interpretation is intuitive, which is an advantage of Bayesian confidence intervals over frequentist confidence intervals.
Marginal posterior probabilities of variable inclusion, also called regime probabilities, result from implementing stochastic search variable selection (SSVS) and indicate whether predictor variables are insignificant or redundant in a Bayesian linear regression model. In SSVS, β has a multivariate, two-component Gaussian mixture distribution. Both components have a mean of zero, but one component has a large variance and the other component has a small variance. Insignificant predictors are likely to be close to zero; therefore, they are from the component with the small variance. SSVS samples from the space of 2^{p +
1} permutations of a model, each permutation includes or excludes a coefficient, and models with the highest posterior density are sampled more often. Regime probabilities are derived from the sampled models.

Integration methods depend on the functional form of the product $π (β) π (σ^{2}) ℓ (β, σ^{2} | y, x)$ and the integrand, for example, h(β,σ²).

If the product forms the kernel of a known probability distribution, then integrals of h(β,σ²) with respect to the posterior can be analytically tractable. Known kernels often arise when you choose priors and posteriors to form conjugate pairs. In these cases, the first several moments of the distribution are typically known, and estimates are based off them. For details on the analytically tractable posterior distributions offered by the Bayesian linear regression model framework in Econometrics Toolbox, see Analytically Tractable Posteriors.
Otherwise, you must use numerical integration techniques to compute integrals of h(β,σ²) with respect to posterior distributions. Under certain conditions, you can implement numerical integration using Monte Carlo or Markov chain Monte Carlo (MCMC) sampling.
- To perform Monte Carlo estimation, you draw many samples from a probability distribution, apply an appropriate function to each draw (h(β,σ²) is a factor in the function), and average the resulting draws to approximate the integral. A popular Monte Carlo technique is sampling importance resampling [6].
- You implement MCMC when you do not know the probability distribution up to a constant, or you know the conditional distributions of all parameters at least up to a constant. Popular MCMC techniques include Gibbs sampling [2], the Metropolis-Hastings algorithm [5], and slice sampling [9].
For details on posterior estimation of a Bayesian linear regression model in Econometrics Toolbox when the posterior is intractable, see Analytically Intractable Posteriors.

Analytically Tractable Posteriors

The Bayesian linear regression framework in Econometrics Toolbox offers several prior model specifications that yield analytically tractable, conjugate marginal or conditional posteriors. This table identifies the prior models and their corresponding posteriors. When you pass a prior model and data to estimate, MATLAB^® uses these formulae. When the software constructs posteriors, it assumes that the response data y_t, t = 1,...,T, is a random sample from a Gaussian distribution with mean x_tβ and variance σ².

Prior Model Object	Priors	Marginal Posteriors	Conditional Posteriors
`conjugateblm`	$\begin{array}{l} β \| σ^{2} ~ N_{p + 1} (μ, σ^{2} V) . \\ σ^{2} ~ I G (A, B) . \end{array}$ β and σ² are independent.	$\begin{array}{l} β \| y, x ~ t_{p + 1} ({(V^{- 1} + X' X)}^{- 1} [(X' X) \hat{β} + V^{- 1} μ], \frac{2 B^{- 1} + {(y - X \hat{β})}^{'} (y - X \hat{β}) + {(\hat{β} - μ)}^{'} {[V + {(X' X)}^{- 1}]}^{- 1} (\hat{β} - μ)}{2 A + T}, 2 A + T) . \\ σ^{2} \| y, x ~ I G (A + \frac{T}{2}, {[B^{- 1} + \frac{1}{2} {(y - X \hat{β})}^{'} (y - X \hat{β}) + \frac{1}{2} {(\hat{β} - μ)}^{'} {[V + {(X' X)}^{- 1}]}^{- 1} (\hat{β} - μ)]}^{- 1}) . \end{array}$	$\begin{array}{l} β \| σ^{2}, y, x ~ N_{p + 1} ({(V^{- 1} + X' X)}^{- 1} [(X' X) \hat{β} + V^{- 1} μ], σ^{2} {(V^{- 1} + X' X)}^{- 1}) . \\ σ^{2} \| β, y, x ~ I G (A + \frac{T + p + 1}{2}, {[B^{- 1} + \frac{1}{2} {(y - X β)}^{'} (y - X β) + \frac{1}{2} {(β - μ)}^{'} V^{- 1} (β - μ)]}^{- 1}) . \end{array}$
`semiconjugateblm`	$\begin{array}{l} β \| σ^{2} ~ N_{p + 1} (μ, V) . \\ σ^{2} ~ I G (A, B) . \end{array}$ β and σ² are dependent.	Analytically intractable	$\begin{array}{l} β \| σ^{2}, y, x ~ N_{p + 1} ({(V^{- 1} + σ^{- 2} X' X)}^{- 1} [σ^{- 2} (X' X) \hat{β} + V^{- 1} μ], {(V^{- 1} + X' X)}^{- 1}) . \\ σ^{2} \| β, y, x ~ I G (A + \frac{T}{2}, {[B^{- 1} + \frac{1}{2} {(y - X β)}^{'} (y - X β)]}^{- 1}) . \end{array}$
`diffuseblm`	The joint prior pdf is $f_{β, σ^{2}} (β, σ^{2}) \propto \frac{1}{σ^{2}} .$	$\begin{array}{l} β \| y, x ~ t_{p + 1} (\hat{β}, \frac{{(y - X \hat{β})}^{'} (y - X \hat{β})}{T - p - 1} {(X^{'} X)}^{- 1}, T - p - 1) . \\ σ^{2} \| y, x ~ I G (\frac{T - p - 1}{2}, {[\frac{1}{2} {(y - X \hat{β})}^{'} (y - X \hat{β})]}^{- 1}) . \end{array}$	$\begin{array}{l} β \| σ^{2}, y, x ~ N_{p + 1} (\hat{β}, σ^{2} {(X^{'} X)}^{- 1}) . \\ σ^{2} \| β, y, x ~ I G (\frac{T}{2}, {[\frac{1}{2} {(y - X β)}^{'} (y - X β)]}^{- 1}) . \end{array}$
`mixconjugateblm`	$\begin{array}{l} γ = {γ_{1}, ..., γ_{p + 1}} ~ p (γ) . \\ \forall j, γ_{j} \in {0, 1} . \\ \forall j, β_{j} \| σ^{2}, γ_{j} = γ_{j} σ V_{j 1} Z_{1} + (1 - γ_{j}) σ V_{j 2} Z_{2} . \\ Z_{k} ~ N (0, 1); k = 1, 2. \\ σ^{2} ~ I G (A, B) . \end{array}$	Although the marginal posteriors are analytically tractable, MATLAB treats them as intractable for scalability (see [1]).	Analytically tractable if γ_j and γ_k are independent, for all j ≠ k $\begin{array}{l} γ_{j} \| β, γ_{\neq j}, σ^{2}, X, y ~ Bernoulli (\frac{a_{j}}{a_{j} + b_{j}}); j = 1, ..., p + 1. \\ \forall j, a_{j} = P (γ_{j} = 1) ϕ (0, σ^{2} V_{j 1}^{}) . \\ \forall j, b_{j} = P (γ_{j} = 0) ϕ (0, σ^{2} V_{j 2}^{}) . \\ β \| σ^{2}, γ, X, y ~ N_{p + 1} ({(V^{}^{- 1} + X' X)}^{- 1} X^{'} Y, σ^{2} {(V^{}^{- 1} + X' X)}^{- 1}) . \\ σ^{2} \| β, γ, X, y ~ I G (A + \frac{T + p + 1}{2}, {[B^{- 1} + \frac{1}{2} {(y - X β)}^{'} (y - X β) + \frac{1}{2} β^{'} V^{*}^{- 1} β]}^{- 1}) . \end{array}$
`mixsemiconjugateblm`	$\begin{array}{l} γ = {γ_{1}, ..., γ_{p + 1}} ~ p (γ) . \\ \forall j, γ_{j} \in {0, 1} . \\ \forall j, β_{j} \| σ^{2}, γ_{j} = γ_{j} V_{j 1} Z_{1} + (1 - γ_{j}) V_{j 2} Z_{2} . \\ Z_{k} ~ N (0, 1); k = 1, 2. \\ σ^{2} ~ I G (A, B) . \end{array}$	Analytically intractable	Analytically tractable if γ_j and γ_k are independent, for all j ≠ k $\begin{array}{l} γ_{j} \| β, γ_{\neq j}, σ^{2}, X, y ~ Bernoulli (\frac{a_{j}}{a_{j} + b_{j}}); j = 1, ..., p + 1. \\ \forall j, a_{j} = P (γ_{j} = 1) ϕ (0, V_{j 1}^{}) . \\ \forall j, b_{j} = P (γ_{j} = 0) ϕ (0, V_{j 2}^{}) . \\ β \| σ^{2}, γ, X, y ~ N_{p + 1} ({(V^{}^{- 1} + σ^{- 2} X' X)}^{- 1} X^{'} Y, {(V^{}^{- 1} + σ^{- 2} X' X)}^{- 1}) . \\ σ^{2} \| β, γ, X, y ~ I G (A + \frac{T}{2}, {[B^{- 1} + \frac{1}{2} {(y - X β)}^{'} (y - X β)]}^{- 1}) . \end{array}$
`lassoblm`	$\begin{array}{l} β_{j} \| σ^{2}, λ ~ Laplace (0, σ / λ); j = 0, .., p . \\ σ^{2} ~ I G (A, B) . \end{array}$ Coefficients are independent, a priori.	Analytically intractable	$\begin{array}{l} \frac{1}{ψ_{j}} \| β_{j}, σ^{2}, λ ~ InvGaussian (σ λ / \| β_{j} \|, λ^{2}); j = 1, ..., p + 1. \\ D = diag (ψ_{1}, ..., ψ_{p + 1}) . \\ β \| σ^{2}, λ, X, y, ψ ~ N_{p + 1} ({(X^{'} X + D)}^{- 1} X^{'} y, σ^{2} {(X^{'} X + D)}^{- 1}) . \\ σ^{2} \| β, X, y, ψ ~ I G (A + \frac{T + p + 1}{2}, {[B^{- 1} + \frac{1}{2} {(y - X β)}^{'} (y - X β) + \frac{1}{2} β^{'} D β]}^{- 1}) . \end{array}$

In the table:

N_p+1(m,Σ) denotes the (p + 1)-dimensional multivariate normal distribution, where m is the mean (a (p + 1)-by-1 vector) and Σ is the variance (a (p + 1)-by-(p + 1) symmetric, positive definite matrix).
IG(A,B) denotes the inverse gamma distribution with shape A > 0 and scale B > 0. The pdf of an IG(A,B) is

$f (x; A, B) = \frac{1}{Γ (A) B^{A}} x^{- A - 1} e^{- \frac{1}{x B}} .$
X is a T-by-(p + 1) matrix of predictor data, that is, x_jk is observation j of predictor k. The first column is composed entirely of ones for the intercept.
y is a T-by-1 vector of responses.
t_p+1(m,Σ,ν) denotes the (p + 1)-dimensional multivariate t distribution, where m is the location, Σ is the scale, and ν is the degrees of freedom.
$\hat{β} = {(X^{'} X)}^{- 1} X^{'} y$ , that is, the least-squares estimate of β.
V^*_j1 is the prior variance factor (mixconjugate) or variance (mixsemiconjugate) of β_j when γ_j = 1, and V^*_j2 is its prior variance factor or variance when γ_j = 0.
V^* is a (p + 1)-by-(p + 1) diagonal matrix, and element j,j is γ_jV^*_j1 + (1 – γ_j)V^*_j2.
mixconjugateblm and mixsemiconjugateblm models support prior mean specifications for β other than the default zero vector for both components of the Gaussian mixture model. If you change the default prior mean β, then the corresponding conditional posterior distributions include the prior means in the same way that the conditional posterior distributions of conjugateblm and semiconjugateblm models include the prior means.
λ is the fixed lasso shrinkage parameter.
InvGaussian(m,v) denotes the inverse Gaussian (Wald) with mean m and shape v.

Analytically Intractable Posteriors

The Bayesian linear regression framework in Econometrics Toolbox offers several prior model specifications that yield analytically intractable, but flexible, marginal and conditional posteriors. This table identifies the prior models and the Monte Carlo sampling techniques that MATLAB uses to perform posterior estimation, simulation, and inference when you pass a prior model and data to estimate, simulate, or forecast.

Prior Model Object	Priors	Simulation Technique for Marginal Posterior	Simulation Technique for Conditional Posterior
`semiconjugateblm`	$\begin{array}{l} β \| σ^{2} ~ N_{p + 1} (μ, V) . \\ σ^{2} ~ I G (A, B) . \end{array}$ β and σ² are dependent.	Gibbs sampler [2]	Conditional posterior is analytically tractable
`empiricalblm`	Characterized by draws from the respective prior distributions	Sampling importance resampling [4]	Not supported
`customblm`	Characterized by the joint pdf. in a declared function	Hamiltonian Monte Carlo sampler [8] Random walk Metropolis sampler [7] Slice sampler [9]	Hamiltonian Monte Carlo sampler Random walk Metropolis sampler Slice sampler
`mixconjugateblm`	$\begin{array}{l} γ = {γ_{1}, ..., γ_{p + 1}} ~ p (γ) . \\ \forall j, γ_{j} \in {0, 1} . \\ \forall j, β_{j} \| σ^{2}, γ_{j} = γ_{j} σ V_{j 1} Z_{1} + (1 - γ_{j}) σ V_{j 2} Z_{2} . \\ Z_{k} ~ N (0, 1); k = 1, 2. \\ σ^{2} ~ I G (A, B) . \end{array}$	Gibbs sampler [1]	Conditional posterior is analytically tractable
`mixsemiconjugateblm`	$\begin{array}{l} γ = {γ_{1}, ..., γ_{p + 1}} ~ p (γ) . \\ \forall j, γ_{j} \in {0, 1} . \\ \forall j, β_{j} \| σ^{2}, γ_{j} = γ_{j} V_{j 1} Z_{1} + (1 - γ_{j}) V_{j 2} Z_{2} . \\ Z_{k} ~ N (0, 1); k = 1, 2. \\ σ^{2} ~ I G (A, B) . \end{array}$	Gibbs sampler [1]	Conditional posterior is analytically tractable
`lassoblm`	$\begin{array}{l} β_{j} \| σ^{2}, λ ~ Laplace (0, σ / λ); j = 0, .., p . \\ σ^{2} ~ I G (A, B) . \end{array}$ Coefficients are independent, a priori.	Gibbs sampler [10]	Conditional posterior is analytically tractable

References

[1] George, E. I., and R. E. McCulloch. "Variable Selection Via Gibbs Sampling." Journal of the American Statistical Association. Vol. 88, No. 423, 1993, pp. 881–889.

[2] Gelfand, A. E., and A. F. M. Smith. “Sampling-Based Approaches to Calculating Marginal Densities.” Journal of the American Statistical Association. Vol. 85, 1990, pp. 398–409.

[3] Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis, 2nd. Ed. Boca Raton, FL: Chapman & Hall/CRC, 2004.

[4] Gordon, N. J., D. J. Salmond, and A. F. M. Smith. "Novel Approach to Nonlinear/Non-Gaussian Bayesian State Estimation." IEEE Proceedings F on Radar and Signal Processing. Vol. 140, 1993, pp. 107–113.

[5] Hastings, W. K. “Monte Carlo Sampling Methods Using Markov Chains and Their Applications.” Biometrika. Vol. 57, 1970, pp. 97–109.

[6] Marin, J. M., and C. P. Robert. Bayesian Core: A Practical Approach to Computational Bayesian Statistics. New York: Springer Science+Business Media, LLC, 2007.

[7] Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. "Equations of State Calculations by Fast Computing Machine." J. Chem. Phys. Vol. 21, 1953, pp. 1087–1091.

[8] Neal, R. M. "MCMC using Hamiltonian dynamics." In S. Brooks, A. Gelman, G. Jones, and X.-L. Meng (eds.) Handbook of Markov Chain Monte Carlo. Boca Raton, FL: Chapman & Hall/CRC, 2011.

[9] Neal, R. M. “Slice Sampling.” The Annals of Statistics. Vol. 31, 2003, pp. 705–767.

[10] Park, T., and G. Casella. "The Bayesian Lasso." Journal of the American Statistical Association. Vol. 103, No. 482, 2008, pp. 681–686.