Multiple linear regression
Estimate Multiple Linear Regression Coefficients
carsmall data set. Identify weight and horsepower as predictors and mileage as the response.
load carsmall x1 = Weight; x2 = Horsepower; % Contains NaN data y = MPG;
Compute the regression coefficients for a linear model with an interaction term.
X = [ones(size(x1)) x1 x2 x1.*x2]; b = regress(y,X) % Removes NaN data
b = 4×1 60.7104 -0.0102 -0.1882 0.0000
Plot the data and the model.
scatter3(x1,x2,y,'filled') hold on x1fit = min(x1):100:max(x1); x2fit = min(x2):10:max(x2); [X1FIT,X2FIT] = meshgrid(x1fit,x2fit); YFIT = b(1) + b(2)*X1FIT + b(3)*X2FIT + b(4)*X1FIT.*X2FIT; mesh(X1FIT,X2FIT,YFIT) xlabel('Weight') ylabel('Horsepower') zlabel('MPG') view(50,10) hold off
Diagnose Outliers Using Residuals
examgrades data set.
Use the last exam scores as response data and the first two exam scores as predictor data.
y = grades(:,5); X = [ones(size(grades(:,1))) grades(:,1:2)];
Perform multiple linear regression with alpha = 0.01.
[~,~,r,rint] = regress(y,X,0.01);
Diagnose outliers by finding the residual intervals
rint that do not contain 0.
contain0 = (rint(:,1)<0 & rint(:,2)>0); idx = find(contain0==false)
idx = 2×1 53 54
54 are possible outliers.
Create a scatter plot of the residuals. Fill in the points corresponding to the outliers.
hold on scatter(y,r) scatter(y(idx),r(idx),'b','filled') xlabel("Last Exam Grades") ylabel("Residuals") hold off
Determine Significance of Linear Regression Relationship
hald data set. Use
heat as the response variable and
ingredients as the predictor data.
load hald y = heat; X1 = ingredients; x1 = ones(size(X1,1),1); X = [x1 X1]; % Includes column of ones
Perform multiple linear regression and generate model statistics.
[~,~,~,~,stats] = regress(y,X)
stats = 1×4 0.9824 111.4792 0.0000 5.9830
Because the value of
0.9824 is close to 1, and the p-value of
0.0000 is less than the default significance level of 0.05, a significant linear regression relationship exists between the response
y and the predictor variables in
y — Response data
Response data, specified as an n-by-1 numeric vector.
y correspond to different observations.
y must have the same number of rows as
X — Predictor data
Predictor data, specified as an
n-by-p numeric matrix. Rows of
X correspond to observations, and columns
correspond to predictor variables.
X must have the same
number of rows as
alpha — Significance level
0.05 (default) | positive scalar
Significance level, specified as a positive scalar.
alpha must be between 0 and 1.
b — Coefficient estimates for multiple linear regression
Coefficient estimates for multiple linear regression, returned as a
b is a p-by-1
vector, where p is the number of predictors in
X. If the columns of
regress sets the maximum number of
b to zero.
bint — Lower and upper confidence bounds for coefficient estimates
Lower and upper confidence bounds for coefficient estimates, returned as a
bint is a p-by-2
matrix, where p is the number of predictors in
X. The first column of
contains lower confidence bounds for each of the coefficient estimates; the
second column contains upper confidence bounds. If the columns of
X are linearly dependent,
regress returns zeros in elements of
bint corresponding to the zero elements of
r — Residuals
Residuals, returned as a numeric vector.
r is an
n-by-1 vector, where n is the
number of observations, or rows, in
rint — Intervals to diagnose outliers
Intervals to diagnose outliers, returned as a numeric matrix.
rint is an n-by-2 matrix, where
n is the number of observations, or rows, in
X. If the interval
i does not contain zero, the corresponding
residual is larger than expected in
100*(1-alpha)% of new
observations, suggesting an outlier. For more information, see Algorithms.
stats — Model statistics
Model statistics, returned as a numeric vector including the R2 statistic, the F-statistic and its p-value, and an estimate of the error variance.
Xmust include a column of ones so that the model contains a constant term. The F-statistic and its p-value are computed under this assumption and are not correct for models without a constant.
The F-statistic is the test statistic of the F-test on the regression model. The F-test looks for a significant linear regression relationship between the response variable and the predictor variables.
The R2 statistic can be negative for models without a constant, indicating that the model is not appropriate for the data.
In a linear model, observed values of
y and their residuals
are random variables. Residuals have normal distributions with zero mean but with
different variances at different values of the predictors. To put residuals on a
regress “Studentizes” the
residuals. That is,
regress divides the residuals by an
estimate of their standard deviation that is independent of their value. Studentized
residuals have t-distributions with known degrees of freedom. The
intervals returned in
rint are shifts of the
100*(1-alpha)% confidence intervals of these
t-distributions, centered at the residuals.
regress is useful when you simply need the output arguments of
the function and when you want to repeat fitting a model multiple times in a loop. If
you need to investigate a fitted regression model further, create a linear regression
LinearModel by using
object provides more features than
Use the properties of
LinearModelto investigate a fitted linear regression model. The object properties include information about coefficient estimates, summary statistics, fitting method, and input data.
Use the object functions of
LinearModelto predict responses and to modify, evaluate, and visualize the linear regression model.
fitlmfunction does not require a column of ones in the input data. A model created by
fitlmalways includes an intercept term unless you specify not to include it by using the
'Intercept'name-value pair argument.
You can find the information in the output of
regressusing the properties and object functions of
Equivalent Values in
Estimatecolumn of the
Rawcolumn of the
Not supported. Instead, use studentized residuals (
Residualsproperty) and observation diagnostics (
Diagnosticsproperty) to find outliers.
See the model display in the Command Window. You can find the statistics in the model properties (
Rsquared) and by using the
 Chatterjee, S., and A. S. Hadi. “Influential Observations, High Leverage Points, and Outliers in Linear Regression.” Statistical Science. Vol. 1, 1986, pp. 379–416.
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.
This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).
Introduced before R2006a