Multiple linear regression
b = regress(y,X)
[b,bint] = regress(y,X)
[b,bint,r] = regress(y,X)
[b,bint,r,rint] = regress(y,X)
[b,bint,r,rint,stats] = regress(y,X)
[___] = regress(y,X,alpha)
carsmall data set. Identify weight and horsepower as predictors and mileage as the response.
load carsmall x1 = Weight; x2 = Horsepower; % Contains NaN data y = MPG;
Compute the regression coefficients for a linear model with an interaction term.
X = [ones(size(x1)) x1 x2 x1.*x2]; b = regress(y,X) % Removes NaN data
b = 4×1 60.7104 -0.0102 -0.1882 0.0000
Plot the data and the model.
scatter3(x1,x2,y,'filled') hold on x1fit = min(x1):100:max(x1); x2fit = min(x2):10:max(x2); [X1FIT,X2FIT] = meshgrid(x1fit,x2fit); YFIT = b(1) + b(2)*X1FIT + b(3)*X2FIT + b(4)*X1FIT.*X2FIT; mesh(X1FIT,X2FIT,YFIT) xlabel('Weight') ylabel('Horsepower') zlabel('MPG') view(50,10) hold off
examgrades data set.
Use the last exam scores as response data and the first two exam scores as predictor data.
y = grades(:,5); X = [ones(size(grades(:,1))) grades(:,1:2)];
Perform multiple linear regression with alpha = 0.01.
[~,~,r,rint] = regress(y,X,0.01);
Diagnose outliers by finding the residual intervals
rint that do not contain 0.
contain0 = (rint(:,1)<0 & rint(:,2)>0); idx = find(contain0==false)
idx = 2×1 53 54
54 are possible outliers.
Create a scatter plot of the residuals. Fill in the points corresponding to the outliers.
hold on scatter(y,r) scatter(y(idx),r(idx),'b','filled') xlabel("Last Exam Grades") ylabel("Residuals") hold off
hald data set. Use
heat as the response variable and
ingredients as the predictor data.
load hald y = heat; X1 = ingredients; x1 = ones(size(X1,1),1); X = [x1 X1]; % Includes column of ones
Perform multiple linear regression and generate model statistics.
[~,~,~,~,stats] = regress(y,X)
stats = 1×4 0.9824 111.4792 0.0000 5.9830
Because the value of
0.9824 is close to 1, and the p-value of
0.0000 is less than the default significance level of 0.05, a significant linear regression relationship exists between the response
y and the predictor variables in
y— Response data
Response data, specified as an n-by-1 numeric vector.
y correspond to different observations.
y must have the same number of rows as
X— Predictor data
Predictor data, specified as an
n-by-p numeric matrix. Rows of
X correspond to observations, and columns
correspond to predictor variables.
X must have the same
number of rows as
alpha— Significance level
0.05(default) | positive scalar
Significance level, specified as a positive scalar.
alpha must be between 0 and 1.
b— Coefficient estimates for multiple linear regression
Coefficient estimates for multiple linear regression, returned as a
b is a p-by-1
vector, where p is the number of predictors in
X. If the columns of
regress sets the maximum number of
b to zero.
bint— Lower and upper confidence bounds for coefficient estimates
Lower and upper confidence bounds for coefficient estimates, returned as a
bint is a p-by-2
matrix, where p is the number of predictors in
X. The first column of
contains lower confidence bounds for each of the coefficient estimates; the
second column contains upper confidence bounds. If the columns of
X are linearly dependent,
regress returns zeros in elements of
bint corresponding to the zero elements of
Residuals, returned as a numeric vector.
r is a
p-by-1 vector, where p is the
number of predictors in
rint— Intervals to diagnose outliers
Intervals to diagnose outliers, returned as a numeric matrix.
rint is a p-by-2 matrix, where
p is the number of predictors in
X. If the interval
i does not contain zero, the corresponding
residual is larger than expected in
100*(1-alpha)% of new
observations, suggesting an outlier. For more information, see Algorithms.
stats— Model statistics
Model statistics, returned as a numeric vector including the R2 statistic, the F-statistic and its p-value, and an estimate of the error variance.
X must include a column of ones so that the
model contains a constant term. The F-statistic
and its p-value are computed under this
assumption and are not correct for models without a constant.
The F-statistic is the test statistic of the F-test on the regression model. The F-test looks for a significant linear regression relationship between the response variable and the predictor variables.
The R2 statistic can be negative for models without a constant, indicating that the model is not appropriate for the data.
In a linear model, observed values of
y and their residuals
are random variables. Residuals have normal distributions with zero mean but with
different variances at different values of the predictors. To put residuals on a
regress “Studentizes” the
residuals. That is,
regress divides the residuals by an
estimate of their standard deviation that is independent of their value. Studentized
residuals have t-distributions with known degrees of freedom. The
intervals returned in
rint are shifts of the
100*(1-alpha)% confidence intervals of these
t-distributions, centered at the residuals.
 Chatterjee, S., and A. S. Hadi. “Influential Observations, High Leverage Points, and Outliers in Linear Regression.” Statistical Science. Vol. 1, 1986, pp. 379–416.