Belsley collinearity diagnostics

`collintest(`

displays Belsley
collinearity diagnostics for assessing the strength and sources
of collinearity among variables in the matrix or tabular array `X`

)`X`

.

`collintest(`

uses
additional options specified by one or more `X`

,`Name,Value`

)`Name,Value`

pairs.

returns
the singular values in
decreasing order, using any of the previous input arguments.`sValue`

= collintest(___)

`[`

additionally returns the condition indices and variance
decomposition proportions.`sValue`

,`condIdx`

,`VarDecomp`

]
= collintest(___)

Display collinearity diagnostics for multiple time series.

Load data of Canadian inflation and interest rates.

```
load Data_Canada
```

Display the Belsley collinearity diagnostics, using all default options.

collintest(DataTable)

Variance Decomposition sValue condIdx INF_C INF_G INT_S INT_M INT_L --------------------------------------------------------- 2.1748 1 0.0012 0.0018 0.0003 0.0000 0.0001 0.4789 4.5413 0.0261 0.0806 0.0035 0.0006 0.0012 0.1602 13.5795 0.3386 0.3802 0.0811 0.0011 0.0137 0.1211 17.9617 0.6138 0.5276 0.1918 0.0004 0.0193 0.0248 87.8245 0.0202 0.0099 0.7233 0.9979 0.9658

Only the last row in the display has a condition index larger than the default tolerance, 30. In this row, the last three variables (in the last three columns) have variance-decomposition proportions exceeding the default tolerance, 0.5. This suggests that the variables `INT_S`

, `INT_M`

, and `INT_L`

exhibit multicollinearity.

Plot collinearity diagnostics for multiple time series.

Load data of Canadian inflation and interest rates.

```
load Data_Canada
```

Plot the Belsley collinearity diagnostics using the `plot`

option.

collintest(DataTable,'plot','on')

Variance Decomposition sValue condIdx INF_C INF_G INT_S INT_M INT_L --------------------------------------------------------- 2.1748 1 0.0012 0.0018 0.0003 0.0000 0.0001 0.4789 4.5413 0.0261 0.0806 0.0035 0.0006 0.0012 0.1602 13.5795 0.3386 0.3802 0.0811 0.0011 0.0137 0.1211 17.9617 0.6138 0.5276 0.1918 0.0004 0.0193 0.0248 87.8245 0.0202 0.0099 0.7233 0.9979 0.9658

The plot corresponds to the values in the last row of variance-decomposition proportions, which is the only one with a condition index larger than the default tolerance, 30. The last three variables in this row have variance-decomposition proportions exceeding the default tolerance, 0.5, indicated by red markers in the plot.

Compute collinearity diagnostics for multiple time series and return the singular values, condition indices, and variance-decomposition proportions.

Load data of Canadian inflation and interest rates.

```
load Data_Canada
```

Compute the Belsley collinearity diagnostics. Turn off the results display using the `display`

option.

[sv,conIdx,varDecomp] = collintest(DataTable,'display',... 'off');

There is no display of the results.

Display the contents of `varDecomp`

.

varDecomp

varDecomp = 0.0012 0.0018 0.0003 0.0000 0.0001 0.0261 0.0806 0.0035 0.0006 0.0012 0.3386 0.3802 0.0811 0.0011 0.0137 0.6138 0.5276 0.1918 0.0004 0.0193 0.0202 0.0099 0.7233 0.9979 0.9658

The output argument `varDecomp`

is a matrix of the variance-decomposition proportions. `sv`

is a vector of singular values in descending order, and `condIdx`

is a vector of condition indices in ascending order.

`X`

— Input regression variablesnumeric matrix | tabular arrayInput regression variables, specified as a `numObs`

-by-`numVars`

numeric
matrix or tabular array. Each column of `X`

corresponds
to a variable, and each row corresponds to an observation. For models
with an intercept, `X`

should contain a column of
ones.

`collintest`

scales the columns of `X`

to
unit length before processing. Data in `X`

should
not be centered.

If `X`

is a tabular array, then the variables
must be numeric.

**Data Types: **`double`

| `table`

Specify optional comma-separated pairs of `Name,Value`

arguments.
`Name`

is the argument
name and `Value`

is the corresponding
value. `Name`

must appear
inside single quotes (`' '`

).
You can specify several name and value pair
arguments in any order as `Name1,Value1,...,NameN,ValueN`

.

`'plot','on','tolIdx',35`

displays
a results plot with a tolerance index of 35`'varNames'`

— Variable namescell vector of stringsVariable names used in displays and plots of the results, specified
as the comma-separated pair consisting of `'varNames'`

and
a cell vector of strings. `varNames`

must have length `numVars`

,
and each cell corresponds to a variable name. If an intercept term
is present, then `varNames`

must include the intercept
term (e.g., include the name `'Const'`

). The software
truncates all variable names to the first five characters.

If

`X`

is a matrix, then the default value of`varNames`

is the cell vector of strings`{'var1','var2',...}`

.If

`X`

is a tabular array, then the default value of`varNames`

is the property`X.Properties.VariableNames`

.

**Example: **`'varNames',{'Const','AGE','BBD'}`

**Data Types: **`cell`

`'display'`

— Display results indicator`'on'`

(default) | `'off'`

Display results indicator for whether or not to display results
in the Command Window, specified as the comma-separated pair consisting
of `'display'`

and one of `'on'`

or `'off'`

.
If you specify the value `'on'`

, then all outputs
are displayed in tabular form.

**Example: **`'display','off'`

`'plot'`

— Plot results indicator`'off'`

(default) | `'on'`

Plot results indicator for whether or not to plot results, specified
as the comma-separated pair consisting of `'plot'`

and
one of `'on'`

or `'off'`

.

If you specify the value

`'on'`

, then the plot shows the critical rows of the output`VarDecomp`

, that is, those rows with condition indices above the input tolerance`tolIdx`

.If a group of at least two variables in a critical row have variance-decomposition proportions above the input tolerance

`tolProp`

, then the group is identified with red markers.

**Example: **`'plot','on'`

`'tolIdx'`

— Condition index tolerance`30`

(default) | scalar value of at least 1Condition index tolerance, specified as the comma-separated
pair consisting of `'tolIdx'`

and a scalar value
of at least one. `collintest`

uses the tolerance
to decide which indices are large enough to infer a near dependency
in the data. The `tolIdx`

value is only used when `plot`

has
the value `'on'`

.

**Example: **`'tolIdx',25`

`'tolProp'`

— Variance-decomposition proportion tolerance`0.5`

(default) | scalar between 0 and 1Variance-decomposition proportion tolerance, specified as the
comma-separated pair consisting of `'tolProp'`

and
a scalar value between zero and one. `collintest`

uses
the tolerance to decide which variables are involved in any near dependency.
The `tolProp`

value is only used when `plot`

has
the value `'on'`

.

**Example: **`'tolProp',0.4`

`sValue`

— Singular valuesvector in descending orderSingular values of
scaled `X`

, returned as a vector. The elements
of `sValue`

are in descending order.

`condIdx`

— Condition indicesvector in ascending orderCondition indices,
returned as a vector with elements in ascending order. All condition
indices have value between one and the condition number of scaled `X`

.
Large indices identify near dependencies among the variables in `X`

.
The size of the indices is a measure of how near dependencies are
to collinearity.

`VarDecomp`

— Variance-decomposition proportionsmatrixVariance-decomposition
proportions, returned as a `numVars`

-by-`numVars`

matrix.
Large proportions, combined with a large condition index, identify
groups of variables involved in near dependencies. The size of the
proportions is a measure of how badly the regression is degraded by
the dependency.

*Belsley collinearity diagnostics* assess
the strength and sources of collinearity among variables in a multiple
linear regression model.

To assess collinearity, the software computes singular values of
the scaled variable matrix, *X*, and then converts
them to condition
indices. The conditional indices identify the number and strength
of any near dependencies between variables in the variable matrix.
The software decomposes the variance of the ordinary least squares
(OLS) estimates of the regression coefficients in terms of the singular
values to identify variables involved in each near dependency, and
the extent to which the dependencies degrade the regression.

The *condition indices* for
a scaled matrix *X* identify the number and strength
of any near dependencies in *X*.

For scaled matrix *X* with *p* columns
and singular values $${S}_{(1)}\ge {S}_{(2)}\ge \dots \ge {S}_{(p)}$$, the condition
indices for the columns of *X* are $${S}_{(1)}/{S}_{(j)},$$ *j* = 1,...,*p*.

All condition indices are bounded between one and the condition number.

The *condition number* of
a scaled matrix *X* is an overall diagnostic for
detecting collinearity.

For scaled matrix *X* with *p* columns
and singular values $${S}_{(1)}\ge {S}_{(2)}\ge \dots \ge {S}_{(p)}$$, the condition
number is $${S}_{(1)}/{S}_{(p)}.$$

The condition number achieves its lower bound of one when the
columns of scaled *X* are orthonormal. The condition
number rises as variates exhibit greater dependency.

A limitation of the condition number as a diagnostic is that it fails to provide specifics on the strength and sources of any near dependencies.

A *multiple linear regression model* is
a model of the form $$Y=X\beta +\epsilon .$$ *X* is
a design matrix of regression variables, and *β* is
a vector of regression coefficients.

The *singular values* of
a scaled matrix *X* are the diagonal elements of
the matrix *S* in the singular-value decomposition $$US{V}^{\prime}.$$

In descending order, the singular values of the scaled matrix *X* with *p* columns
are $${S}_{(1)}\ge {S}_{(2)}\ge \dots \ge {S}_{(p)}$$.

*Variance-decomposition proportions* identify
groups of variates involved in near dependencies, and the extent to
which the dependencies degrade the regression.

From the singular value decomposition $$US{V}^{\prime}$$ of scaled design matrix *X* (with *p* columns),
let:

*V*be the matrix of orthonormal eigenvectors of $${X}^{\prime}X$$$${S}_{(1)}\ge {S}_{(2)}\ge \dots \ge {S}_{(p)}$$ be the ordered diagonal elements of the matrix

*S*

The variance of the OLS estimate of the *i*th
multiple linear regression coefficient, *β _{i}*,
is proportional to the sum

$$V{(i,1)}^{2}/{S}_{(1)}^{2}+V{(i,2)}^{2}/{S}_{(2)}^{2}+\dots +V{(i,p)}^{2}/{S}_{(p)}^{2},$$

where$$V(i,j)$$ denotes the (*i*,*j*)th
element of *V*.

The (*i*,*j*)th variance-decomposition
proportion is the proportion of the *j*th term in
the sum relative to the entire sum, *j* = 1,...,*p*.

The terms $${S}_{(j)}^{2}$$ are the eigenvalues of scaled $${X}^{\prime}X$$. Thus, large variance-decomposition proportions correspond to small eigenvalues of $${X}^{\prime}X$$, a common diagnostic for collinearity. The singular-value decomposition provides a more direct, numerically stable view of the eigensystem of scaled $${X}^{\prime}X$$.

For purposes of collinearity diagnostics, Belsley [1] shows that column scaling of the design matrix,

`X`

, is always desirable. However, he also shows that centering the data in`X`

is undesirable. For models with an intercept, if you center the data in`X`

, then the role of the constant term in any near dependency is hidden, and yields misleading diagnostics.Tolerances for identifying large condition indices and variance-decomposition proportions are comparable to critical values in standard hypothesis tests. Experience determines the most useful tolerance, but experiments suggest the

`collintest`

defaults are good starting points [1].

[1] Belsley, D. A., E. Kuh, and R. E. Welsh. *Regression
Diagnostics*. New York, NY: John Wiley & Sons, Inc.,
1980.

[2] Judge, G. G., W. E. Griffiths, R. C. Hill, H. Lϋtkepohl,
and T. C. Lee. *The Theory and Practice of Econometrics*.
New York, NY: John Wiley & Sons, Inc., 1985.

Was this topic helpful?