Ridge regression

`b = ridge(y,X,k)`

b = ridge(y,X,k,scaled)

returns
a vector `b = ridge(y,X,k)`

`b`

of coefficient estimates for a multilinear
ridge regression of the responses in `y`

on the predictors
in `X`

. `X`

is an *n*-by-*p* matrix
of *p* predictors at each of *n* observations. `y`

is
an *n*-by-1 vector of observed responses. `k`

is
a vector of ridge parameters. If `k`

has *m* elements, `b`

is *p*-by-*m*.
By default, `b`

is computed after centering and scaling
the predictors to have mean 0 and standard deviation 1. The model
does not include a constant term, and `X`

should
not contain a column of 1s.

`b = ridge(y,X,k,scaled)`

uses the {`0`

,`1`

}-valued
flag `scaled`

to determine if the coefficient estimates
in `b`

are restored to the scale of the original
data. `ridge(y,X,k,0)`

performs this additional transformation.
In this case, `b`

contains *p*+1
coefficients for each value of `k`

, with the first
row corresponding to a constant term in the model. `ridge(y,X,k,1)`

is
the same as `ridge(y,X,k)`

. In this case, `b`

contains *p* coefficients,
without a coefficient for a constant term.

The relationship between `b0 = ridge(y,X,k,0)`

and ```
b1
= ridge(y,X,k,1)
```

is given by

m = mean(X); s = std(X,0,1)'; b1_scaled = b1./s; b0 = [mean(y)-m*b1_scaled; b1_scaled]

This can be seen by replacing the *x*_{i} (*i* = 1, ..., *n*) in the multilinear model *y* = *b*_{0}^{0} + *b*_{1}^{0}*x*_{1} +
... + *b*_{n}^{0}*x*_{n} with the *z*-scores *z*_{i} =
(*x*_{i} – *μ*_{i})/*σ*_{i }, and replacing *y* with *y* – *μ*_{y}.

In general, `b1`

is more useful for producing
plots in which the coefficients are to be displayed on the same scale,
such as a *ridge
trace* (a plot of the regression coefficients as a function
of the ridge parameter). `b0`

is more useful for
making predictions.

Coefficient estimates for multiple linear regression models
rely on the independence of the model terms. When terms are correlated
and the columns of the design matrix *X* have an
approximate linear dependence, the matrix (*X*^{T}*X*)^{–1} becomes close to singular. As a result, the least-squares
estimate

$$\widehat{\beta}={({X}^{T}X)}^{-1}{X}^{T}y$$

becomes highly sensitive to random errors in the observed response *y*,
producing a large variance. This situation of *multicollinearity* can
arise, for example, when data are collected without an experimental
design.

*Ridge regression* addresses the problem
by estimating regression coefficients using

$$\widehat{\beta}={({X}^{T}X+kI)}^{-1}{X}^{T}y$$

where *k* is the *ridge parameter* and *I* is
the identity matrix. Small positive values of *k* improve
the conditioning of the problem and reduce the variance of the estimates.
While biased, the reduced variance of ridge estimates often result
in a smaller mean square error when compared to least-squares estimates.

[1] Hoerl, A. E., and R. W. Kennard. "Ridge
Regression: Biased Estimation for Nonorthogonal Problems." *Technometrics*.
Vol. 12, No. 1, 1970, pp. 55–67.

[2] Hoerl, A. E., and R. W. Kennard. "Ridge
Regression: Applications to Nonorthogonal Problems." *Technometrics*.
Vol. 12, No. 1, 1970, pp. 69–82.

[3] Marquardt, D.W. "Generalized Inverses,
Ridge Regression, Biased Linear Estimation, and Nonlinear Estimation." *Technometrics*.
Vol. 12, No. 3, 1970, pp. 591–612.

[4] Marquardt, D. W., and R.D. Snee. "Ridge
Regression in Practice." *The American Statistician*.
Vol. 29, No. 1, 1975, pp. 3–20.

Was this topic helpful?