Documentation |
On this page… |
---|
Partial least-squares (PLS) regression is a technique used with data that contain correlated predictor variables. This technique constructs new predictor variables, known as components, as linear combinations of the original predictor variables. PLS constructs these components while considering the observed response values, leading to a parsimonious model with reliable predictive power.
The technique is something of a cross between multiple linear regression and principal component analysis:
Multiple linear regression finds a combination of the predictors that best fit a response.
Principal component analysis finds combinations of the predictors with large variance, reducing correlations. The technique makes no use of response values.
PLS finds combinations of the predictors that have a large covariance with the response values.
PLS therefore combines information about the variances of both the predictors and the responses, while also considering the correlations among them.
PLS shares characteristics with other regression and feature transformation techniques. It is similar to ridge regression in that it is used in situations with correlated predictors. It is similar to stepwise regression (or more general feature selection techniques) in that it can be used to select a smaller set of model terms. PLS differs from these methods, however, by transforming the original predictor space into the new component space.
The Statistics Toolbox™ function plsregress carries out PLS regression.
For example, consider the data on biochemical oxygen demand in moore.mat, padded with noisy versions of the predictors to introduce correlations:
load moore y = moore(:,6); % Response X0 = moore(:,1:5); % Original predictors X1 = X0+10*randn(size(X0)); % Correlated predictors X = [X0,X1];
Use plsregress to perform PLS regression with the same number of components as predictors, then plot the percentage variance explained in the response as a function of the number of components:
[XL,yl,XS,YS,beta,PCTVAR] = plsregress(X,y,10); plot(1:10,cumsum(100*PCTVAR(2,:)),'-bo'); xlabel('Number of PLS components'); ylabel('Percent Variance Explained in y');
Choosing the number of components in a PLS model is a critical step. The plot gives a rough indication, showing nearly 80% of the variance in y explained by the first component, with as many as five additional components making significant contributions.
The following computes the six-component model:
[XL,yl,XS,YS,beta,PCTVAR,MSE,stats] = plsregress(X,y,6); yfit = [ones(size(X,1),1) X]*beta; plot(y,yfit,'o')
The scatter shows a reasonable correlation between fitted and observed responses, and this is confirmed by the R^{2} statistic:
TSS = sum((y-mean(y)).^2); RSS = sum((y-yfit).^2); Rsquared = 1 - RSS/TSS Rsquared = 0.8421
A plot of the weights of the ten predictors in each of the six components shows that two of the components (the last two computed) explain the majority of the variance in X:
plot(1:10,stats.W,'o-'); legend({'c1','c2','c3','c4','c5','c6'},'Location','NW') xlabel('Predictor'); ylabel('Weight');
A plot of the mean-squared errors suggests that as few as two components may provide an adequate model:
[axes,h1,h2] = plotyy(0:6,MSE(1,:),0:6,MSE(2,:)); set(h1,'Marker','o') set(h2,'Marker','o') legend('MSE Predictors','MSE Response') xlabel('Number of Components')
The calculation of mean-squared errors by plsregress is controlled by optional parameter name/value pairs specifying cross-validation type and the number of Monte Carlo repetitions.