This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

regress

Multiple linear regression

Syntax

b = regress(y,X)
[b,bint] = regress(y,X)
[b,bint,r] = regress(y,X)
[b,bint,r,rint] = regress(y,X)
[b,bint,r,rint,stats] = regress(y,X)
[___] = regress(y,X,alpha)

Description

example

b = regress(y,X) returns a vector b of coefficient estimates for a multiple linear regression of the responses in vector y on the predictors in matrix X. The matrix X must include a column of ones.

[b,bint] = regress(y,X) also returns a matrix bint of 95% confidence intervals for the coefficient estimates.

[b,bint,r] = regress(y,X) also returns an additional vector r of residuals.

example

[b,bint,r,rint] = regress(y,X) also returns a matrix rint of intervals that can be used to diagnose outliers.

example

[b,bint,r,rint,stats] = regress(y,X) also returns a vector stats that contains the R2 statistic, the F-statistic and its p-value, and an estimate of the error variance.

example

[___] = regress(y,X,alpha) uses a 100*(1-alpha)% confidence level to compute bint and rint. Specify any of the output argument combinations in the previous syntaxes.

Examples

collapse all

Load the carsmall data set. Identify weight and horsepower as predictors and mileage as the response.

load carsmall
x1 = Weight;
x2 = Horsepower;    % Contains NaN data
y = MPG;

Compute the regression coefficients for a linear model with an interaction term.

X = [ones(size(x1)) x1 x2 x1.*x2];
b = regress(y,X)    % Removes NaN data
b = 4×1

   60.7104
   -0.0102
   -0.1882
    0.0000

Plot the data and the model.

scatter3(x1,x2,y,'filled')
hold on
x1fit = min(x1):100:max(x1);
x2fit = min(x2):10:max(x2);
[X1FIT,X2FIT] = meshgrid(x1fit,x2fit);
YFIT = b(1) + b(2)*X1FIT + b(3)*X2FIT + b(4)*X1FIT.*X2FIT;
mesh(X1FIT,X2FIT,YFIT)
xlabel('Weight')
ylabel('Horsepower')
zlabel('MPG')
view(50,10)
hold off

Load the examgrades data set.

load examgrades

Use the last exam scores as response data and the first two exam scores as predictor data.

y = grades(:,5);
X = [ones(size(grades(:,1))) grades(:,1:2)];

Perform multiple linear regression with alpha = 0.01.

[~,~,r,rint] = regress(y,X,0.01);

Diagnose outliers by finding the residual intervals rint that do not contain 0.

contain0 = (rint(:,1)<0 & rint(:,2)>0);
idx = find(contain0==false)
idx = 2×1

    53
    54

Observations 53 and 54 are possible outliers.

Create a scatter plot of the residuals. Fill in the points corresponding to the outliers.

hold on
scatter(y,r)
scatter(y(idx),r(idx),'b','filled')
xlabel("Last Exam Grades")
ylabel("Residuals")
hold off

Load the hald data set. Use heat as the response variable and ingredients as the predictor data.

load hald
y = heat;
X1 = ingredients;
x1 = ones(size(X1,1),1);
X = [x1 X1];    % Includes column of ones

Perform multiple linear regression and generate model statistics.

[~,~,~,~,stats] = regress(y,X)
stats = 1×4

    0.9824  111.4792    0.0000    5.9830

Because the value of 0.9824 is close to 1, and the p-value of 0.0000 is less than the default significance level of 0.05, a significant linear regression relationship exists between the response y and the predictor variables in X.

Input Arguments

collapse all

Response data, specified as an n-by-1 numeric vector. Rows of y correspond to different observations. y must have the same number of rows as X.

Data Types: single | double

Predictor data, specified as an n-by-p numeric matrix. Rows of X correspond to observations, and columns correspond to predictor variables. X must have the same number of rows as y.

Data Types: single | double

Significance level, specified as a positive scalar. alpha must be between 0 and 1.

Data Types: single | double

Output Arguments

collapse all

Coefficient estimates for multiple linear regression, returned as a numeric vector. b is a p-by-1 vector, where p is the number of predictors in X. If the columns of X are linearly dependent, regress sets the maximum number of elements of b to zero.

Data Types: double

Lower and upper confidence bounds for coefficient estimates, returned as a numeric matrix. bint is a p-by-2 matrix, where p is the number of predictors in X. The first column of bint contains lower confidence bounds for each of the coefficient estimates; the second column contains upper confidence bounds. If the columns of X are linearly dependent, regress returns zeros in elements of bint corresponding to the zero elements of b.

Data Types: double

Residuals, returned as a numeric vector. r is a p-by-1 vector, where p is the number of predictors in X.

Data Types: single | double

Intervals to diagnose outliers, returned as a numeric matrix. rint is a p-by-2 matrix, where p is the number of predictors in X. If the interval rint(i,:) for observation i does not contain zero, the corresponding residual is larger than expected in 100*(1-alpha)% of new observations, suggesting an outlier. For more information, see Algorithms.

Data Types: single | double

Model statistics, returned as a numeric vector including the R2 statistic, the F-statistic and its p-value, and an estimate of the error variance.

  • X must include a column of ones so that the model contains a constant term. The F-statistic and its p-value are computed under this assumption and are not correct for models without a constant.

  • The F-statistic is the test statistic of the F-test on the regression model. The F-test looks for a significant linear regression relationship between the response variable and the predictor variables.

  • The R2 statistic can be negative for models without a constant, indicating that the model is not appropriate for the data.

Data Types: single | double

Tips

  • regress treats NaN values in X or y as missing values. regress omits observations with missing values from the regression fit.

Algorithms

collapse all

Residual Intervals

In a linear model, observed values of y and their residuals are random variables. Residuals have normal distributions with zero mean but with different variances at different values of the predictors. To put residuals on a comparable scale, regress “Studentizes” the residuals. That is, regress divides the residuals by an estimate of their standard deviation that is independent of their value. Studentized residuals have t-distributions with known degrees of freedom. The intervals returned in rint are shifts of the 100*(1-alpha)% confidence intervals of these t-distributions, centered at the residuals.

References

[1] Chatterjee, S., and A. S. Hadi. “Influential Observations, High Leverage Points, and Outliers in Linear Regression.” Statistical Science. Vol. 1, 1986, pp. 379–416.

Introduced before R2006a