Multivariate Normal Regression

Introduction

This section focuses on using likelihood-based methods for multivariate normal regression. The parameters of the regression model are estimated via maximum likelihood estimation. For multiple series, this requires iteration until convergence. The complication due to the possibility of missing data is incorporated into the analysis with a variant of the EM algorithm known as the ECM algorithm.

The underlying theory of maximum likelihood estimation and the definition and significance of the Fisher information matrix can be found in Caines [1] and Cramér [2]. The underlying theory of the ECM algorithm can be found in Meng and Rubin [8] and Sexton and Swensen [9].

In addition, these two examples of maximum likelihood estimation are presented:

Multivariate Normal Linear Regression

Suppose that you have a multivariate normal linear regression model in the form

$[\begin{matrix} Z_{1} \\ ⋮ \\ Z_{m} \end{matrix}] ~ N ([\begin{matrix} H_{1} b \\ ⋮ \\ H_{m} b \end{matrix}], [\begin{matrix} C & 0 \\ ⋱ \\ 0 & C \end{matrix}]),$

where the model has m observations of n-dimensional random variables Z₁, ..., Z_m with a linear regression model that has a p-dimensional model parameter vector b. In addition, the model has a sequence of m design matrices H₁, ..., H_m, where each design matrix is a known n-by-p matrix.

Given a parameter vector b and a collection of design matrices, the collection of m independent variables Z_k is assumed to have independent identically distributed multivariate normal residual errors Z_k – H_k b with n-vector mean 0 and n-by-n covariance matrix C for each k = 1, ..., m.

A concise way to write this model is

$Z_{k} \sim N (H_{k} b, C)$

for k = 1, ..., m.

The goal of multivariate normal regression is to obtain maximum likelihood estimates for b and C given a collection of m observations z₁, ..., z_m of the random variables Z₁, ..., Z_m. The estimated parameters are the p distinct elements of b and the n (n + 1)/2 distinct elements of C (the lower-triangular elements of C).

Note

Quasi-maximum likelihood estimation works with the same models but with a relaxation of the assumption of normally distributed residuals. In this case, however, the parameter estimates are asymptotically optimal.

Maximum Likelihood Estimation

To estimate the parameters of the multivariate normal linear regression model using maximum likelihood estimation, it is necessary to maximize the log-likelihood function over the estimation parameters given observations z₁, ... , z_m.

Given the multivariate normal model to characterize residual errors in the regression model, the log-likelihood function is

$\begin{matrix} L (z_{1}, \dots, z_{m}; b, C) = \frac{1}{2} m n \log (2 π) + \frac{1}{2} m \log (\det (C)) \\ + \frac{1}{2} \sum_{k = 1}^{m} {(z_{k} - H_{k} b)}^{T} C^{- 1} (z_{k} - H_{k} b) . \end{matrix}$

Although the cross-sectional residuals must be independent, you can use this log-likelihood function for quasi-maximum likelihood estimation. In this case, the estimates for the parameters b and C provide estimates to characterize the first and second moments of the residuals. See Caines [1] for details.

Except for a special case (see Special Case of Multiple Linear Regression Model), if both the model parameters in b and the covariance parameters in C are to be estimated, the estimation problem is intractably nonlinear and a solution must use iterative methods. Denote estimates for the parameters b and C for iteration t = 0, 1, ... with the superscript notation b⁽^t⁾ and C⁽^t⁾.

Given initial estimates b⁽⁰⁾ and C⁽⁰⁾ for the parameters, the maximum likelihood estimates for b and C are obtained using a two-stage iterative process with

$b^{(t + 1)} = {(\sum_{k = 1}^{m} H_{k}^{T} {(C^{(t)})}^{- 1} H_{k})}^{- 1} (\sum_{k = 1}^{m} H_{k}^{T} {(C^{(t)})}^{- 1} z_{k})$

and

$C^{(t + 1)} = \frac{1}{m} \sum_{k = 1}^{m} (z_{k} - H_{k} b^{(t + 1)}) {(z_{k} - H_{k} b^{(t + 1)})}^{T}$

for t = 0, 1, ... .

Special Case of Multiple Linear Regression Model

The special case mentioned in Maximum Likelihood Estimation occurs if n = 1 so that the sequence of observations is a sequence of scalar observations. This model is known as a multiple linear regression model. In this case, the covariance matrix C is a 1-by-1 matrix that drops out of the maximum likelihood iterates so that a single-step estimate for b and C can be obtained with converged estimates b⁽¹⁾ and C⁽¹⁾.

Least-Squares Regression

Another simplification of the general model is called least-squares regression. If b⁽⁰⁾ = 0 and C⁽⁰⁾ = I, then b⁽¹⁾ and C⁽¹⁾ from the two-stage iterative process are least-squares estimates for b and C, where

$b^{L S} = {(\sum_{k = 1}^{m} H_{k}^{T} H_{k})}^{- 1} (\sum_{k = 1}^{m} H_{k}^{T} z_{k})$

and

$C^{L S} = \frac{1}{m} \sum_{k = 1}^{m} (z_{k} - H_{k} b^{L S}) {(z_{k} - H_{k} b^{L S})}^{T} .$

Mean and Covariance Estimation

A final simplification of the general model is to estimate the mean and covariance of a sequence of n-dimensional observations z₁, ..., z_m. In this case, the number of series is equal to the number of model parameters with n = p and the design matrices are identity matrices with H_k = I for i = 1, ..., m so that b is an estimate for the mean and C is an estimate of the covariance of the collection of observations z₁, ..., z_m.

Convergence

If the iterative process continues until the log-likelihood function increases by no more than a specified amount, the resultant estimates are said to be maximum likelihood estimates b^ML and C^ML.

If n = 1 (which implies a single data series), convergence occurs after only one iterative step, which, in turn, implies that the least-squares and maximum likelihood estimates are identical. If, however, n > 1, the least-squares and maximum likelihood estimates are usually distinct.

In Financial Toolbox™ software, both the changes in the log-likelihood function and the norm of the change in parameter estimates are monitored. Whenever both changes fall below specified tolerances (which should be something between machine precision and its square root), the toolbox functions terminate under an assumption that convergence has been achieved.

Fisher Information

Since maximum likelihood estimates are formed from samples of random variables, their estimators are random variables; an estimate derived from such samples has an uncertainty associated with it. To characterize these uncertainties, which are called standard errors, two quantities are derived from the total log-likelihood function.

The Hessian of the total log-likelihood function is

$\nabla^{2} L (z_{1}, \dots, z_{m}; θ)$

and the Fisher information matrix is

$I (θ) = - E [\nabla^{2} L (z_{1}, \dots, z_{m}; θ)],$

where the partial derivatives of the $\nabla^{2}$ operator are taken with respect to the combined parameter vector Θ that contains the distinct components of b and C with a total of q = p + n (n + 1)/2 parameters.

Since maximum likelihood estimation is concerned with large-sample estimates, the central limit theorem applies to the estimates and the Fisher information matrix plays a key role in the sampling distribution of the parameter estimates. Specifically, maximum likelihood parameter estimates are asymptotically normally distributed such that

$(θ^{(t)} - θ) \sim N (0, I^{- 1}, (θ^{(t)})) as t \to \infty,$

where Θ is the combined parameter vector and Θ⁽^t⁾ is the estimate for the combined parameter vector at iteration t = 0, 1, ... .

The Fisher information matrix provides a lower bound, called a Cramér-Rao lower bound, for the standard errors of estimates of the model parameters.

Statistical Tests

Given an estimate for the combined parameter vector Θ, the squared standard errors are the diagonal elements of the inverse of the Fisher information matrix

$s^{2} ({\hat{θ}}_{i}) = {(I^{- 1} ({\hat{θ}}_{i}))}_{i i}$

for i = 1, ..., q.

Since the standard errors are estimates for the standard deviations of the parameter estimates, you can construct confidence intervals so that, for example, a 95% interval for each parameter estimate is approximately

${\hat{θ}}_{i} \pm 1.96 s ({\hat{θ}}_{i})$

for i = 1, ..., q.

Error ellipses at a level-of-significance α ε [0, 1] for the parameter estimates satisfy the inequality

${(θ - \hat{θ})}^{T} I (\hat{θ}) (θ - \hat{θ}) \leq χ_{1 - α, q}^{2}$

and follow a $χ^{2}$ distribution with q degrees-of-freedom. Similar inequalities can be formed for any subcollection of the parameters.

In general, given parameter estimates, the computed Fisher information matrix, and the log-likelihood function, you can perform numerous statistical tests on the parameters, the model, and the regression.