collintest(X) displays
diagnostics for assessing the strength and sources of collinearity
among variables in a multiple linear regression
model. Singular
values of the scaled variable matrix, X,
are converted to condition
indices, which identify the number and strength of any near
dependencies in the design matrix. The variance of the ordinary least
squares (OLS) estimates of the regression coefficients is decomposed
in terms of the singular values to identify variables involved in
each near dependency, and the extent to which the dependencies degrade
the regression.

Only the last row in the display has a condition index larger than the default tolerance, 30. In this row, the last three variables (in the last three columns) have variance-decomposition proportions exceeding the default tolerance, 0.5.

The plot corresponds to the values in the last row of variance-decomposition proportions, which is the only one with a condition index larger than the default tolerance, 30. The last three variables in this row have variance-decomposition proportions exceeding the default tolerance, 0.5, indicated by red markers in the plot.

The output argument varDecomp is a matrix of the variance-decomposition proportions. sv is a vector of singular values in descending order, and condIdx is a vector of condition indices in ascending order.

Input regression variables, specified as a numObs-by-numVars matrix
or dataset array. For models with an intercept, X should
contain a column of ones.

collintest scales the columns of X to
unit length before processing. Data in X should
not be centered.

Data Types: double

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments.
Name is the argument
name and Value is the corresponding
value. Name must appear
inside single quotes (' ').
You can specify several name and value pair
arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'plot','on','tolIdx',35 displays
a results plot with a tolerance index of 35

Variable names used in displays and plots of the results, specified
as the comma-separated pair consisting of 'varNames' and
a cell vector of strings. varNames must have length numVars,
and each cell corresponds to a variable name. If an intercept term
is present, then varNames must include the intercept
term (e.g., include the name 'Const'). The software
truncates all variable names to the first five characters.

If X is a matrix, then the default
value of varNames is the cell vector of strings {'var1','var2',...}.

If X is a dataset array,
then the default value of varNames is the property X.Properties.VarNames.

Display results indicator for whether or not to display results
in the Command Window, specified as the comma-separated pair consisting
of 'display' and one of 'on' or 'off'.
If you specify the value 'on', then all outputs
are displayed in tabular form.

Plot results indicator for whether or not to plot results, specified
as the comma-separated pair consisting of 'plot' and
one of 'on' or 'off'.

If you specify the value 'on',
then the plot shows the critical rows of the output VarDecomp ;
that is, those rows with condition
indices above the input tolerance tolIdx.

If a group of at least two variables in a critical
row have variance-decomposition proportions above
the input tolerance tolProp, then the group is
identified with red markers.

Condition index tolerance, specified as the comma-separated
pair consisting of 'tolIdx' and a scalar value
of at least one. collintest uses the tolerance
to decide which indices are large enough to infer a near dependency
in the data. The tolIdx value is only used when plot has
the value 'on'.

Variance-decomposition proportion tolerance, specified as the
comma-separated pair consisting of 'tolProp' and
a scalar value between zero and one. collintest uses
the tolerance to decide which variables are involved in any near dependency.
The tolProp value is only used when plot has
the value 'on'.

Condition indices,
returned as a vector with elements in ascending order. All condition
indices have value between one and the condition number of scaled X.
Large indices identify near dependencies among the variables in X.
The size of the indices is a measure of how near dependencies are
to collinearity.

Variance-decomposition
proportions, returned as a numVars-by-numVars matrix.
Large proportions, combined with a large condition index, identify
groups of variables involved in near dependencies. The size of the
proportions is a measure of how badly the regression is degraded by
the dependency.

The condition number of
a scaled matrix X is an overall diagnostic for
detecting collinearity.

For scaled matrix X with p columns
and singular values , the condition
number is

The condition number achieves its lower bound of one when the
columns of scaled X are orthonormal. The condition
number rises as variates exhibit greater dependency.

A limitation of the condition number as a diagnostic is that
it fails to provide specifics on the strength and sources of any near
dependencies.

Variance-decomposition proportions identify
groups of variates involved in near dependencies, and the extent to
which the dependencies degrade the regression.

From the singular value decomposition of scaled design matrix X (with p columns),
let:

V be the matrix of orthonormal
eigenvectors of

be the ordered
diagonal elements of the matrix S

The variance of the OLS estimate of the ith
multiple linear regression coefficient, β_{i},
is proportional to the sum

where denotes the (i,j)th
element of V.

The (i,j)th variance-decomposition
proportion is the proportion of the jth term in
the sum relative to the entire sum, j = 1,...,p.

The terms are the eigenvalues
of scaled . Thus, large variance-decomposition
proportions correspond to small eigenvalues of , a common diagnostic for collinearity.
The singular-value decomposition provides a more direct, numerically
stable view of the eigensystem of scaled .

For purposes of collinearity diagnostics, Belsley [1] shows that column scaling of the design matrix, X,
is always desirable. However, centering the data in X is
undesirable. For models with an intercept, centering can hide the
role of the constant term in any near dependency, and produce misleading
diagnostics.

Tolerances for identifying large condition indices
and variance-decomposition proportions are comparable to critical
values in standard hypothesis tests. Experience determines the most
useful tolerance, but experiments suggest the collintest defaults
are good starting points [1].

References

[1] Belsley, D. A., E. Kuh, and R. E. Welsh. Regression
Diagnostics. New York, NY: John Wiley & Sons, Inc.,
1980.

[2] Judge, G. G., W. E. Griffiths, R. C. Hill, H. Lϋtkepohl,
and T. C. Lee. The Theory and Practice of Econometrics.
New York, NY: John Wiley & Sons, Inc., 1985.