# ridge

Ridge regression

## Description

example

B = ridge(y,X,k) returns coefficient estimates for ridge regression models of the predictor data X and the response y. Each column of B corresponds to a particular ridge parameter k. By default, the function computes B after centering and scaling the predictors to have mean 0 and standard deviation 1. Because the model does not include a constant term, do not add a column of 1s to X.

example

B = ridge(y,X,k,scaled) specifies the scaling for the coefficient estimates in B. When scaled is 1 (default), ridge does not restore the coefficients to the original data scale. When scaled is 0, ridge restores the coefficients to the scale of the original data. For more information, see Coefficient Scaling.

## Examples

collapse all

Perform ridge regression for a range of ridge parameters and observe how the coefficient estimates change.

acetylene contains observations for the predictor variables x1, x2, and x3, and the response variable y.

Plot the predictor variables against each other. Observe any correlation between the variables.

plotmatrix([x1 x2 x3])

For example, note the linear correlation between x1 and x3.

Compute coefficient estimates for a multilinear model with interaction terms, for a range of ridge parameters. Use x2fx to create interaction terms and ridge to perform ridge regression.

X = [x1 x2 x3];
D = x2fx(X,'interaction');
D(:,1) = []; % No constant term
k = 0:1e-5:5e-3;
B = ridge(y,D,k);

Plot the ridge trace.

figure
plot(k,B,'LineWidth',2)
ylim([-100 100])
grid on
xlabel('Ridge Parameter')
ylabel('Standardized Coefficient')
title('Ridge Trace')
legend('x1','x2','x3','x1x2','x1x3','x2x3')

The estimates stabilize to the right of the plot. Note that the coefficient of the x2x3 interaction term changes sign at a value of the ridge parameter $\approx 5*1{0}^{-4}$ .

Predict miles per gallon (MPG) values using ridge regression.

X = [Acceleration Weight Displacement Horsepower];
y = MPG;

Split the data into training and test sets.

n = length(y);
rng('default') % For reproducibility
c = cvpartition(n,'HoldOut',0.3);
idxTrain = training(c,1);
idxTest = ~idxTrain;

Find the coefficients of a ridge regression model (with k = 5).

k = 5;
b = ridge(y(idxTrain),X(idxTrain,:),k,0);

Predict MPG values for the test data using the model.

yhat = b(1) + X(idxTest,:)*b(2:end);

Compare the predicted values to the actual miles per gallon (MPG) values using a reference line.

scatter(y(idxTest),yhat)
hold on
plot(y(idxTest),y(idxTest))
xlabel('Actual MPG')
ylabel('Predicted MPG')
hold off

## Input Arguments

collapse all

Response data, specified as an n-by-1 numeric vector, where n is the number of observations.

Data Types: single | double

Predictor data, specified as an n-by-p numeric matrix. The rows of X correspond to the n observations, and the columns of X correspond to the p predictors.

Data Types: single | double

Ridge parameters, specified as a numeric vector.

Example: [0.2 0.3 0.4 0.5]

Data Types: single | double

Scaling flag that determines whether the coefficient estimates in B are restored to the scale of the original data, specified as either 0 or 1. If scaled is 0, then ridge performs this additional transformation. In this case, B contains p+1 coefficients for each value of k, with the first row of B corresponding to a constant term in the model. If scaled is 1, then the software omits the additional transformation, and B contains p coefficients without a constant term coefficient.

## Output Arguments

collapse all

Coefficient estimates, returned as a numeric matrix. The rows of B correspond to the predictors in X, and the columns of B correspond to the ridge parameters k.

If scaled is 1, then B is a p-by-m matrix, where m is the number of elements in k. If scaled is 0, then B is a (p+1)-by-m matrix.

collapse all

### Ridge Regression

Ridge regression is a method for estimating coefficients of linear models that include linearly correlated predictors.

Coefficient estimates for multiple linear regression models rely on the independence of the model terms. When terms are correlated and the columns of the design matrix X have an approximate linear dependence, the matrix (XTX)–1 is close to singular. Therefore, the least-squares estimate

$\stackrel{^}{\beta }={\left({X}^{T}X\right)}^{-1}{X}^{T}y$

is highly sensitive to random errors in the observed response y, producing a large variance. This situation of multicollinearity can arise, for example, when you collect data without an experimental design.

Ridge regression addresses the problem of multicollinearity by estimating regression coefficients using

$\stackrel{^}{\beta }={\left({X}^{T}X+kI\right)}^{-1}{X}^{T}y$

where k is the ridge parameter and I is the identity matrix. Small, positive values of k improve the conditioning of the problem and reduce the variance of the estimates. While biased, the reduced variance of ridge estimates often results in a smaller mean squared error when compared to least-squares estimates.

### Coefficient Scaling

The scaling of the coefficient estimates for the ridge regression models depends on the value of the scaled input argument.

Suppose the ridge parameter k is equal to 0. The coefficients returned by ridge, when scaled is equal to 1, are estimates of the bi1 in the multilinear model

yμy = b11z1 + ... + bp1zp + ε

where zi = (xiμi)/σi are the centered and scaled predictors, yμy is the centered response, and ε is an error term. You can rewrite the model as

y = b00 + b10x1 + ... + bp0xp + ε

with ${b}_{0}^{0}={\mu }_{y}-\sum _{i=1}^{p}\frac{{b}_{i}^{1}{\mu }_{i}}{{\sigma }_{i}}$ and ${b}_{i}^{0}=\frac{{b}_{i}^{1}}{{\sigma }_{i}}$. The bi0 terms correspond to the coefficients returned by ridge when scaled is equal to 0.

More generally, for any value of k, if B1 = ridge(y,X,k,1), then

m = mean(X);
s = std(X,0,1)';
B1_scaled = B1./s;
B0 = [mean(y)-m*B1_scaled; B1_scaled]

where B0 = ridge(y,X,k,0).

## Tips

• ridge treats NaN values in X or y as missing values. ridge omits observations with missing values from the ridge regression fit.

• In general, set scaled equal to 1 to produce plots where the coefficients are displayed on the same scale. See Ridge Regression for an example using a ridge trace plot, where the regression coefficients are displayed as a function of the ridge parameter. When making predictions, set scaled equal to 0. For an example, see Predict Values Using Ridge Regression.

## Alternative Functionality

• Ridge, lasso, and elastic net regularization are all methods for estimating the coefficients of a linear model while penalizing large coefficients. The type of penalty depends on the method (see More About for more details). To perform lasso or elastic net regularization, use lasso instead.

• If you have high-dimensional full or sparse predictor data, you can use fitrlinear instead of ridge. When using fitrlinear, specify the 'Regularization','ridge' name-value pair argument. Set the value of the 'Lambda' name-value pair argument to a vector of the ridge parameters of your choice. fitrlinear returns a trained linear model Mdl. You can access the coefficient estimates stored in the Beta property of the model by using Mdl.Beta.

## References

[1] Hoerl, A. E., and R. W. Kennard. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics. Vol. 12, No. 1, 1970, pp. 55–67.

[2] Hoerl, A. E., and R. W. Kennard. “Ridge Regression: Applications to Nonorthogonal Problems.” Technometrics. Vol. 12, No. 1, 1970, pp. 69–82.

[3] Marquardt, D. W. “Generalized Inverses, Ridge Regression, Biased Linear Estimation, and Nonlinear Estimation.” Technometrics. Vol. 12, No. 3, 1970, pp. 591–612.

[4] Marquardt, D. W., and R. D. Snee. “Ridge Regression in Practice.” The American Statistician. Vol. 29, No. 1, 1975, pp. 3–20.