Documentation |
Each regression function has a specific operation. This section shows how to use these functions to perform specific types of regressions. To illustrate use of the functions for various regressions, "typical" usage is shown with optional arguments kept to a minimum. For a typical regression, you estimate model parameters and residual covariance matrices with the mle functions and estimate the standard errors of model parameters with the std functions. The regressions "without missing data" essentially ignore samples with any missing values, and the regressions "with missing data" ignore samples with every value missing.
Multivariate normal regression, or MVNR, is the "standard" implementation of the regression functions in Financial Toolbox™ software.
Estimate Parameters
[Parameters, Covariance] = mvnrmle(Data, Design);
Estimate Standard Errors
StdParameters = mvnrstd(Data, Design, Covariance);
Estimate Parameters
[Parameters, Covariance] = ecmmvnrmle(Data, Design);
Estimate Standard Errors
StdParameters = ecmmvnrstd(Data, Design, Covariance);
Least-squares regression, or LSR, sometimes called ordinary least-squares or multiple linear regression, is the simplest linear regression model. It also enjoys the property that, independent of the underlying distribution, it is a best linear unbiased estimator (BLUE).
Given m = NumSamples observations, the typical least-squares regression model seeks to minimize the objective function
$$\sum _{k=1}^{m}{\left({Z}_{k}-{H}_{k}b\right)}^{T}\left({Z}_{k}-{H}_{k}b\right)},$$
which, within the maximum likelihood framework of the multivariate normal regression routine mvnrmle, is equivalent to a single-iteration estimation of just the parameters to obtain Parameters with the initial covariance matrix Covariance held fixed as the identity matrix. In the case of missing data, however, the internal algorithm to handle missing data requires a separate routine ecmlsrmle to do least-squares instead of multivariate normal regression.
Estimate Parameters
[Parameters, Covariance] = mvnrmle(Data, Design, 1);
Estimate Standard Errors
StdParameters = mvnrstd(Data, Design, Covariance);
Estimate Parameters
[Parameters, Covariance] = ecmlsrmle(Data, Design);
Estimate Standard Errors
StdParameters = ecmmvnrstd(Data, Design, Covariance);
Given m = NUMSAMPLES observations, the typical covariance-weighted least squares, or CWLS, regression model seeks to minimize the objective function
$$\sum _{k=1}^{m}{\left({Z}_{k}-{H}_{k}b\right)}^{T}{C}_{0}\left({Z}_{k}-{H}_{k}b\right)$$
with fixed covariance C_{0}.
In most cases, C_{0} is a diagonal matrix. The inverse matrix $$W={C}_{0}^{-1}$$ has diagonal elements that can be considered relative "weights" for each series. Thus, CWLS is a form of weighted least squares with the weights applied across series.
Estimate Parameters
[Parameters, Covariance] = mvnrmle(Data, Design, 1, [], [], [], Covar0);
Estimate Standard Errors
StdParameters = mvnrstd(Data, Design, Covariance);
Estimate Parameters
[Parameters, Covariance] = ecmlsrmle(Data, Design, [], [], [], [], Covar0);
Estimate Standard Errors
StdParameters = ecmmvnrstd(Data, Design, Covariance);
An ad hoc form of least squares that has surprisingly good properties for misspecified or nonnormal models is known as feasible generalized least squares, or FGLS. The basic procedure is to do least-squares regression and then to do covariance-weighted least-squares regression with the resultant residual covariance from the first regression.
Estimate Parameters
[Parameters, Covariance] = mvnrmle(Data, Design, 2, 0, 0);
or (to illustrate the FGLS process explicitly)
[Parameters, Covar0] = mvnrmle(Data, Design, 1); [Parameters, Covariance] = mvnrmle(Data, Design, 1, [], [], [], Covar0);
Estimate Standard Errors
StdParameters = mvnrstd(Data, Design, Covariance);
Estimate Parameters
[Parameters, Covar0] = ecmlsrmle(Data, Design); [Parameters, Covariance] = ecmlsrmle(Data, Design, [], [], [], [], Covar0);
Estimate Standard Errors
StdParameters = ecmmvnrstd(Data, Design, Covariance);
Given a multivariate normal regression model in standard form with a Data matrix and a Design array, it is possible to convert the problem into a seemingly unrelated regression (SUR) problem by a simple transformation of the Design array. The main idea of SUR is that instead of having a common parameter vector over all data series, you have a separate parameter vector associated with each separate series or with distinct groups of series that, nevertheless, share a common residual covariance. It is this ability to aggregate and disaggregate series and to perform comparative tests on each design that is the power of SUR.
To make the transformation, use the function convert2sur, which converts a standard-form design array into an equivalent design array to do SUR with a specified mapping of the series into NUMGROUPS groups. The regression functions are used in the usual manner, but with the SUR design array instead of the original design array. Instead of having NUMPARAMS elements, the SUR output parameter vector has NUMGROUPS of stacked parameter estimates, where the first NUMPARAMS elements of Parameters contain parameter estimates associated with the first group of series, the next NUMPARAMS elements of Parameters contain parameter estimates associated with the second group of series, and so on. If the model has only one series, for example, NUMSERIES = 1, then the SUR design array is the same as the original design array since SUR requires two or more series to generate distinct parameter estimates.
Given NUMPARAMS parameters and NUMGROUPS groups with a parameter vector Parameters with NUMGROUPS * NUMPARAMS elements from any of the regression routines, the following MATLAB^{®} code fragment shows how to print a table of SUR parameter estimates with rows that correspond to each parameter and columns that correspond to each group or series:
fprintf(1,'Seemingly Unrelated Regression Parameter Estimates\n'); fprintf(1,' %7s ',' '); fprintf(1,' Group(%3d) ',1:NumGroups); fprintf(1,'\n'); for i = 1:NumParams fprintf(1,' %7d ',i); ii = i; for j = 1:NumGroups fprintf(1,'%12g ',Param(ii)); ii = ii + NumParams; end fprintf(1,'\n'); end fprintf(1,'\n');
Form an SUR Design
DesignSUR = convert2sur(Design, Group);
Estimate Parameters
[Parameters, Covariance] = mvnrmle(Data, DesignSUR);
Estimate Standard Errors
StdParameters = mvnrstd(Data, DesignSUR, Covariance);
Form an SUR Design
DesignSUR = convert2sur(Design, Group);
Estimate Parameters
[Parameters, Covariance] = ecmmvnrmle(Data, DesignSUR);
Estimate Standard Errors
StdParameters = ecmmvnrstd(Data, DesignSUR, Covariance);
Without missing data, you can estimate the mean of your Data with the function mean and the covariance with the function cov. Nevertheless, the function ecmnmle does this for you if it detects an absence of missing values. Otherwise, it uses the ECM algorithm to handle missing values.
Estimate Parameters
[Mean, Covariance] = ecmnmle(Data);
Estimate Standard Errors
StdMean = ecmnstd(Data, Mean, Covariance);
This section provides a few pointers to handle various technical and operational difficulties that might occur.
If samples are ignored, the number of samples used in the estimation is less than NumSamples. Clearly the actual number of samples used must be sufficient to obtain estimates. In addition, although the model parameters Parameters (or mean estimates Mean) are unbiased maximum likelihood estimates, the residual covariance estimate Covariance is biased. To convert to an unbiased covariance estimate, multiply Covariance by
$$\text{Count}/\left(\text{Count}-1\right),$$
where Count is the actual number of samples used in the estimation with Count ≤ NumSamples. Note that none of the regression functions perform this adjustment.
The regression functions, particularly the estimation functions, have several requirements. First, they must have consistent values for NumSamples, NumSeries, and NumParams. As a general rule, the multivariate normal regression functions require
$$\text{Count}\times \text{NumSeries}\le \mathrm{max}\left\{\text{NumParams},\text{}\text{NumSeries}\times \left(\text{NumSeries}+1\right)/2\right\}$$
and the least-squares regression functions require
$$\text{Count}\times \text{NumSeries}\le \text{NumParams},$$
where Count is the actual number of samples used in the estimation with
$$\text{Count}\le \text{NumSamples}\text{.}$$
Second, they must have enough nonmissing values to converge. Third, they must have a nondegenerate covariance matrix.
Although some necessary and sufficient conditions can be found in the references, general conditions for existence and uniqueness of solutions in the missing-data case do not exist. Nonconvergence is usually due to an ill-conditioned covariance matrix estimate, which is discussed in greater detail in Nonconvergence.
Since worst-case convergence of the ECM algorithm is linear, it is possible to execute hundreds and even thousands of iterations before termination of the algorithm. If you are estimating with the ECM algorithm on a regular basis with regular updates, you can use prior estimates as initial guesses for the next period's estimation. This approach often speeds things up since the default initialization in the regression functions sets the initial parameters b to zero and the initial covariance C to be the identity matrix.
Other ad hoc approaches are possible although most approaches are problem-dependent. In particular, for mean and covariance estimation, the estimation function ecmnmle uses a function ecmninit to obtain an initial estimate.
Simultaneous estimates for parameters b and covariances C require C to be positive-definite. Consequently, the general multivariate normal regression routines require nondegenerate residual errors. If you are faced with a model that has exact results, the least-squares routine ecmlsrmle still works, although it provides a least-squares estimate with a singular residual covariance matrix. The other regression functions will fail.
Although the regression functions are robust and work for most "typical" cases, they can fail to converge. The main failure mode is an ill-conditioned covariance matrix, where failures are either soft or hard. A soft failure wanders endlessly toward a nearly singular covariance matrix and can be spotted if the algorithm fails to converge after about 100 iterations. If MaxIterations is increased to 500 and display mode is initiated (with no output arguments), a typical soft failure looks like this.
This case, which is based on 20 observations of 5 assets with 30% of data missing, shows that the log-likelihood goes linearly to infinity as the likelihood function goes to 0. In this case, the function converges but the covariance matrix is effectively singular with a smallest eigenvalue on the order of machine precision (eps).
For the function ecmnmle, a hard error looks like this:
> In ecmninit at 60 In ecmnmle at 140 ??? Error using ==> ecmnmle Full covariance not positive-definite in iteration 218.
From a practical standpoint, if in doubt, test your residual covariance matrix from the regression routines to ensure that it is positive-definite. This is important because a soft error has a matrix that appears to be positive-definite but actually has a near-zero-valued eigenvalue to within machine precision. To do this with a covariance estimate Covariance, use cond(Covariance), where any value greater than 1/eps should be considered suspect.
If either type of failure occurs, however, note that the regression routine is indicating that something is probably wrong with the data. (Even with no missing data, two time series that are proportional to one another produce a singular covariance matrix.)
This example illustrates how to use the missing data algorithms for portfolio optimization and for valuation. This example works with 5 years of daily total return data for 12 computer technology stocks, with 6 hardware and 6 software companies. The example estimates the mean and covariance matrix for these stocks, forms efficient frontiers with both a naïve approach and the ECM approach, and compares results.
You can run the example directly with ecmtechdemo.m.
Load the following data file:
load ecmtechdemo
This file contains these three quantities:
Assets is a cell array of the tickers for the twelve stocks in the example.
Data is a 1254-by-12 matrix of 1254 daily total returns for each of the 12 stocks.
Dates is a 1254-by-1 column vector of the dates associated with the data.
The time period for the data extends from April 19, 2000 to April 18, 2005.
The sixth stock in Assets is Google (GOOG), which started trading on August 19, 2004. Consequently, all returns before August 20, 2004 are missing and represented as NaNs. Also, Amazon (AMZN) had a few days with missing values scattered throughout the past 5 years.
A naïve approach to the estimation of the mean and covariance for these 12 assets is to eliminate all days that have missing values for any of the 12 assets. Use the function ecmninit with the nanskip option to do this.
[NaNMean, NaNCovar] = ecmninit(Data,'nanskip');
Contrast the result of this approach with using all available data and the function ecmnmle to compute the mean and covariance. First, call ecmnmle with no output arguments to establish that enough data is available to obtain meaningful estimates.
ecmnmle(Data);
The following figure shows that, even with almost 87% of the Google data being NaN values, the algorithm converges after only four iterations.
Estimate the mean and covariance as computed by ecmnmle.
[ECMMean, ECMCovar] = ecmnmle(Data)
ECMMean = 0.0008 0.0008 -0.0005 0.0002 0.0011 0.0038 -0.0003 -0.0000 -0.0003 -0.0000 -0.0003 0.0004 ECMCovar = 0.0012 0.0005 0.0006 0.0005 0.0005 0.0003 0.0005 0.0024 0.0007 0.0006 0.0010 0.0004 0.0006 0.0007 0.0013 0.0007 0.0007 0.0003 0.0005 0.0006 0.0007 0.0009 0.0006 0.0002 0.0005 0.0010 0.0007 0.0006 0.0016 0.0006 0.0003 0.0004 0.0003 0.0002 0.0006 0.0022 0.0005 0.0005 0.0006 0.0005 0.0005 0.0001 0.0003 0.0003 0.0004 0.0003 0.0003 0.0002 0.0006 0.0006 0.0008 0.0007 0.0006 0.0002 0.0003 0.0004 0.0005 0.0004 0.0004 0.0001 0.0005 0.0006 0.0008 0.0005 0.0007 0.0003 0.0006 0.0012 0.0008 0.0007 0.0011 0.0016 ECMCovar (continued) 0.0005 0.0003 0.0006 0.0003 0.0005 0.0006 0.0005 0.0003 0.0006 0.0004 0.0006 0.0012 0.0006 0.0004 0.0008 0.0005 0.0008 0.0008 0.0005 0.0003 0.0007 0.0004 0.0005 0.0007 0.0005 0.0003 0.0006 0.0004 0.0007 0.0011 0.0001 0.0002 0.0002 0.0001 0.0003 0.0016 0.0009 0.0003 0.0005 0.0004 0.0005 0.0006 0.0003 0.0005 0.0004 0.0003 0.0004 0.0004 0.0005 0.0004 0.0011 0.0005 0.0007 0.0007 0.0004 0.0003 0.0005 0.0006 0.0004 0.0005 0.0005 0.0004 0.0007 0.0004 0.0013 0.0007 0.0006 0.0004 0.0007 0.0005 0.0007 0.0020
Given estimates for the mean and covariance of asset returns derived from the naïve and ECM approaches, estimate portfolios, and associated expected returns and risks on the efficient frontier for both approaches.
[ECMRisk, ECMReturn, ECMWts] = portopt(ECMMean',ECMCovar,10); [NaNRisk, NaNReturn, NaNWts] = portopt(NaNMean',NaNCovar,10);
Plot the results on the same graph to illustrate the differences.
figure(gcf) plot(ECMRisk,ECMReturn,'-bo','MarkerFaceColor','b','MarkerSize', 3); hold on plot(NaNRisk,NaNReturn,'-ro','MarkerFaceColor','r','MarkerSize', 3); title('\bfMean-Variance Efficient Frontiers under Various Assumptions'); legend('ECM','NaN','Location','SouthEast'); xlabel('\bfStd. Dev. of Returns'); ylabel('\bfMean of Returns'); hold off
Clearly, the naïve approach is optimistic about the risk-return trade-offs for this universe of 12 technology stocks. The proof, however, lies in the portfolio weights. To view the weights, enter
Assets ECMWts NaNWts
which generates
>> Assets ans = 'AAPL' 'AMZN' 'CSCO' 'DELL' 'EBAY' 'GOOG' >> ECMWts ans = 0.0358 0.0011 -0.0000 0.0000 0.0000 0.0989 0.0654 0.0110 0.0000 0.0000 0.0000 0.1877 0.0923 0.0194 0.0000 0.0000 0.0000 0.2784 0.1165 0.0264 0.0000 -0.0000 0.0000 0.3712 0.1407 0.0334 -0.0000 0 0.0000 0.4639 0.1648 0.0403 0.0000 0 -0.0000 0.5566 0.1755 0.0457 0.0000 -0.0000 -0.0000 0.6532 0.1845 0.0509 0.0000 0.0000 -0.0000 0.7502 0.1093 0.0174 -0.0000 0.0000 0 0.8733 0 0 -0.0000 0.0000 0 1.0000 >> NaNWts ans = -0.0000 0.0000 -0.0000 0.1185 0.0000 0.0522 0.0576 -0.0000 -0.0000 0.1219 0.0000 0.0854 0.1248 -0.0000 -0.0000 0.0952 -0.0000 0.1195 0.1969 -0.0000 -0.0000 0.0529 -0.0000 0.1551 0.2690 -0.0000 -0.0000 0.0105 0.0000 0.1906 0.3414 0.0000 -0.0000 -0.0000 -0.0000 0.2265 0.4235 0.0000 -0.0000 -0.0000 -0.0000 0.2639 0.5245 0.0000 -0.0000 -0.0000 -0.0000 0.3034 0.6269 -0.0000 -0.0000 -0.0000 -0.0000 0.3425 1.0000 -0.0000 -0.0000 0.0000 -0.0000 0 Assets (continued) 'HPQ' 'IBM' 'INTC' 'MSFT' 'ORCL' 'YHOO' ECMWts (continued) 0.0535 0.4676 0.0000 0.3431 -0.0000 0.0000 0.0179 0.3899 -0.0000 0.3282 0.0000 -0.0000 0 0.3025 -0.0000 0.3074 0.0000 -0.0000 0.0000 0.2054 -0.0000 0.2806 0.0000 0.0000 0.0000 0.1083 -0.0000 0.2538 -0.0000 0.0000 0.0000 0.0111 -0.0000 0.2271 -0.0000 0.0000 0.0000 0.0000 -0.0000 0.1255 -0.0000 0.0000 0.0000 0 -0.0000 0.0143 -0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0000 NaNWts (continued) 0.0824 0.1779 0.0000 0.5691 -0.0000 0.0000 0.1274 0.0460 0.0000 0.5617 -0.0000 -0.0000 0.1674 -0.0000 0.0000 0.4802 0.0129 -0.0000 0.2056 -0.0000 0.0000 0.3621 0.0274 -0.0000 0.2438 -0.0000 0.0000 0.2441 0.0419 -0.0000 0.2782 -0.0000 0.0000 0.0988 0.0551 -0.0000 0.2788 -0.0000 0.0000 -0.0000 0.0337 -0.0000 0.1721 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 0.0306 -0.0000 0.0000 0.0000 0 -0.0000 0 0.0000 0.0000 -0.0000 -0.0000 -0.0000
The naïve portfolios in NaNWts tend to favor Apple Computer (AAPL), which happened to do well over the period from the Google IPO to the end of the estimation period, while the ECM portfolios in ECMWts tend to underweight Apple Computer and to recommend increased weights in Google relative to the naïve weights.
To evaluate the impact of estimation error and, in particular, the effect of missing data, use ecmnstd to calculate standard errors. Although it is possible to estimate the standard errors for both the mean and covariance, the standard errors for the mean estimates alone are usually the main quantities of interest.
StdMeanF = ecmnstd(Data,ECMMean,ECMCovar,'fisher');
Calculate standard errors that use the data-generated Hessian matrix (which accounts for the possible loss of information due to missing data) with the option HESSIAN.
StdMeanH = ecmnstd(Data,ECMMean,ECMCovar,'hessian');
The difference in the standard errors shows the increase in uncertainty of estimation of asset expected returns due to missing data. This can be viewed by entering
Assets StdMeanH' StdMeanF' StdMeanH' - StdMeanF'
The two assets with missing data, AMZN and GOOG, are the only assets to have differences due to missing information.