Analysis of variance for linear regression model
Create a component ANOVA table from a linear regression model of the hospital
data set.
Load the hospital
data set and create a model of blood pressure as a function of age and gender.
load hospital tbl = table(hospital.Age,hospital.Sex,hospital.BloodPressure(:,2), ... 'VariableNames',{'Age','Sex','BloodPressure'}); tbl.Sex = categorical(tbl.Sex); mdl = fitlm(tbl,'BloodPressure ~ Sex + Age^2')
mdl = Linear regression model: BloodPressure ~ 1 + Age + Sex + Age^2 Estimated Coefficients: Estimate SE tStat pValue _________ ________ ________ _________ (Intercept) 63.942 19.194 3.3314 0.0012275 Age 0.90673 1.0442 0.86837 0.38736 Sex_Male 3.0019 1.3765 2.1808 0.031643 Age^2 0.011275 0.013853 0.81389 0.41772 Number of observations: 100, Error degrees of freedom: 96 Root Mean Squared Error: 6.83 Rsquared: 0.0577, Adjusted RSquared: 0.0283 Fstatistic vs. constant model: 1.96, pvalue = 0.125
Create an ANOVA table of the model.
tbl = anova(mdl)
tbl=4×5 table
SumSq DF MeanSq F pValue
______ __ ______ _______ ________
Age 18.705 1 18.705 0.40055 0.52831
Sex 222.09 1 222.09 4.7558 0.031643
Age^2 30.934 1 30.934 0.66242 0.41772
Error 4483.1 96 46.699
The table displays the following columns for each term except the constant (intercept) term:
SumSq
— Sum of squares explained by the term.
DF
— Degrees of freedom. In this example, DF
is 1 for each term in the model and n – p for the error term, where n is the number of observations and p is the number of coefficients (including the intercept) in the model. For example, the DF
for the error term in this model is 100 – 4 = 96. If any variable in the model is a categorical variable, the DF
for that variable is the number of indicator variables created for its categories (number of categories – 1).
MeanSq
— Mean square, defined by MeanSq = SumSq/DF
. For example, the mean square of the error term, mean squared error (MSE), is 4.4831e+03/96 = 46.6991.
F
— Fstatistic value to test the null hypothesis that the corresponding coefficient is zero, computed by F = MeanSq/MSE
, where MSE
is the mean squared error. When the null hypothesis is true, the Fstatistic follows the Fdistribution. The numerator degrees of freedom is the DF
value for the corresponding term, and the denominator degrees of freedom is n – p. In this example, each Fstatistic follows an $${F}_{(1,96)}$$distribution.
pValue
— pvalue of the Fstatistic value. For example, the pvalue for Age
is 0.5283, implying that Age
is not significant at the 5% significance level given the other terms in the model.
Create a summary ANOVA table from a linear regression model of the hospital
data set.
Load the hospital
data set and create a model of blood pressure as a function of age and gender.
load hospital tbl = table(hospital.Age,hospital.Sex,hospital.BloodPressure(:,2), ... 'VariableNames',{'Age','Sex','BloodPressure'}); tbl.Sex = categorical(tbl.Sex); mdl = fitlm(tbl,'BloodPressure ~ Sex + Age^2')
mdl = Linear regression model: BloodPressure ~ 1 + Age + Sex + Age^2 Estimated Coefficients: Estimate SE tStat pValue _________ ________ ________ _________ (Intercept) 63.942 19.194 3.3314 0.0012275 Age 0.90673 1.0442 0.86837 0.38736 Sex_Male 3.0019 1.3765 2.1808 0.031643 Age^2 0.011275 0.013853 0.81389 0.41772 Number of observations: 100, Error degrees of freedom: 96 Root Mean Squared Error: 6.83 Rsquared: 0.0577, Adjusted RSquared: 0.0283 Fstatistic vs. constant model: 1.96, pvalue = 0.125
Create a summary ANOVA table of the model.
tbl = anova(mdl,'summary')
tbl=7×5 table
SumSq DF MeanSq F pValue
______ __ ______ _______ ________
Total 4757.8 99 48.059
Model 274.73 3 91.577 1.961 0.12501
. Linear 243.8 2 121.9 2.6103 0.078726
. Nonlinear 30.934 1 30.934 0.66242 0.41772
Residual 4483.1 96 46.699
. Lack of fit 1483.1 39 38.028 0.72253 0.85732
. Pure error 3000 57 52.632
The table displays tests for groups of terms: Total
, Model
, and Residual
.
Total
— This row shows the total sum of squares (SumSq
), degrees of freedom (DF
), and the mean squared error (MeanSq
). Note that MeanSq = SumSq/DF
.
Model
— This row includes SumSq
, DF
, MeanSq
, Fstatistic value (F
), and pvalue (pValue
). Because this model includes a nonlinear term (Age^2
), anova
partitions the sum of squares (SumSq
) of Model
into two parts: SumSq
explained by the linear terms (Age
and Sex
) and SumSq
explained by the nonlinear term (Age^2
). The corresponding Fstatistic values are for testing the significance of the linear terms and the nonlinear term as separate groups. The nonlinear group consists of the Age^2
term only, so it has the same pvalue as the Age^2
term in the Component ANOVA Table.
Residual
— This row includes SumSq
, DF
, MeanSq
, F
, and pValue
. Because the data set includes replications, anova
partitions the residual SumSq
into the part for the replications (Pure error
) and the rest (Lack of fit
). To test the lack of fit, anova
computes the Fstatistic value by comparing the model residuals to the modelfree variance estimate computed on the replications. The Fstatistic value shows no evidence of lack of fit.
Fit a linear regression model that contains a categorical predictor. Reorder the categories of the categorical predictor to control the reference level in the model. Then, use anova
to test the significance of the categorical variable.
Model with Categorical Predictor
Load the carsmall
data set and create a linear regression model of MPG
as a function of Model_Year
. To treat the numeric vector Model_Year
as a categorical variable, identify the predictor using the 'CategoricalVars'
namevalue pair argument.
load carsmall mdl = fitlm(Model_Year,MPG,'CategoricalVars',1,'VarNames',{'Model_Year','MPG'})
mdl = Linear regression model: MPG ~ 1 + Model_Year Estimated Coefficients: Estimate SE tStat pValue ________ ______ ______ __________ (Intercept) 17.69 1.0328 17.127 3.2371e30 Model_Year_76 3.8839 1.4059 2.7625 0.0069402 Model_Year_82 14.02 1.4369 9.7571 8.2164e16 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 5.56 Rsquared: 0.531, Adjusted RSquared: 0.521 Fstatistic vs. constant model: 51.6, pvalue = 1.07e15
The model formula in the display, MPG ~ 1 + Model_Year
, corresponds to
$\mathrm{MPG}={\beta}_{0}+{\beta}_{1}{{\rm I}}_{\mathrm{Year}=76}+{\beta}_{2}{{\rm I}}_{\mathrm{Year}=82}+\u03f5$,
where ${{\rm I}}_{\mathrm{Year}=76}$ and ${{\rm I}}_{\mathrm{Year}=82}$ are indicator variables whose value is one if the value of Model_Year
is 76 and 82, respectively. The Model_Year
variable includes three distinct values, which you can check by using the unique
function.
unique(Model_Year)
ans = 3×1
70
76
82
fitlm
chooses the smallest value in Model_Year
as a reference level ('70'
) and creates two indicator variables ${{\rm I}}_{\mathrm{Year}=76}$ and ${{\rm I}}_{\mathrm{Year}=82}$. The model includes only two indicator variables because the design matrix becomes rank deficient if the model includes three indicator variables (one for each level) and an intercept term.
Model with Full Indicator Variables
You can interpret the model formula of mdl
as a model that has three indicator variables without an intercept term:
$\mathit{y}={\beta}_{0}{{\rm I}}_{{\mathit{x}}_{1}=70}+\left({\beta}_{0}+{\beta}_{1}\right){{\rm I}}_{{\mathit{x}}_{1}=76}+\left({{\beta}_{0}+\beta}_{2}\right){{\rm I}}_{{\mathit{x}}_{2}=82}+\u03f5$.
Alternatively, you can create a model that has three indicator variables without an intercept term by manually creating indicator variables and specifying the model formula.
temp_Year = dummyvar(categorical(Model_Year));
Model_Year_70 = temp_Year(:,1);
Model_Year_76 = temp_Year(:,2);
Model_Year_82 = temp_Year(:,3);
tbl = table(Model_Year_70,Model_Year_76,Model_Year_82,MPG);
mdl = fitlm(tbl,'MPG ~ Model_Year_70 + Model_Year_76 + Model_Year_82  1')
mdl = Linear regression model: MPG ~ Model_Year_70 + Model_Year_76 + Model_Year_82 Estimated Coefficients: Estimate SE tStat pValue ________ _______ ______ __________ Model_Year_70 17.69 1.0328 17.127 3.2371e30 Model_Year_76 21.574 0.95387 22.617 4.0156e39 Model_Year_82 31.71 0.99896 31.743 5.2234e51 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 5.56
Choose Reference Level in Model
You can choose a reference level by modifying the order of categories in a categorical variable. First, create a categorical variable Year
.
Year = categorical(Model_Year);
Check the order of categories by using the categories
function.
categories(Year)
ans = 3x1 cell array
{'70'}
{'76'}
{'82'}
If you use Year
as a predictor variable, then fitlm
chooses the first category '70'
as a reference level. Reorder Year
by using the reordercats
function.
Year_reordered = reordercats(Year,{'76','70','82'}); categories(Year_reordered)
ans = 3x1 cell array
{'76'}
{'70'}
{'82'}
The first category of Year_reordered
is '76'
. Create a linear regression model of MPG
as a function of Year_reordered
.
mdl2 = fitlm(Year_reordered,MPG,'VarNames',{'Model_Year','MPG'})
mdl2 = Linear regression model: MPG ~ 1 + Model_Year Estimated Coefficients: Estimate SE tStat pValue ________ _______ _______ __________ (Intercept) 21.574 0.95387 22.617 4.0156e39 Model_Year_70 3.8839 1.4059 2.7625 0.0069402 Model_Year_82 10.136 1.3812 7.3385 8.7634e11 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 5.56 Rsquared: 0.531, Adjusted RSquared: 0.521 Fstatistic vs. constant model: 51.6, pvalue = 1.07e15
mdl2
uses '76'
as a reference level and includes two indicator variables ${{\rm I}}_{\mathrm{Year}=70}$ and ${{\rm I}}_{\mathrm{Year}=82}$.
Evaluate Categorical Predictor
The model display of mdl2
includes a pvalue of each term to test whether or not the corresponding coefficient is equal to zero. Each pvalue examines each indicator variable. To examine the categorical variable Model_Year
as a group of indicator variables, use anova
. Use the 'components'
(default) option to return a component ANOVA table that includes ANOVA statistics for each variable in the model except the constant term.
anova(mdl2,'components')
ans=2×5 table
SumSq DF MeanSq F pValue
______ __ ______ _____ __________
Model_Year 3190.1 2 1595.1 51.56 1.0694e15
Error 2815.2 91 30.936
The component ANOVA table includes the pvalue of the Model_Year
variable, which is smaller than the pvalues of the indicator variables.
mdl
— Linear regression model objectLinearModel
object  CompactLinearModel
objectLinear regression model object, specified as a LinearModel
object created by using fitlm
or stepwiselm
, or a CompactLinearModel
object created by using compact
.
anovatype
— ANOVA type'component'
(default)  'summary'
ANOVA type, specified as one of these values:
'component'
— anova
returns the table tbl
with ANOVA statistics for
each variable in the model except the constant term.
'summary'
— anova
returns the table tbl
with summary ANOVA
statistics for grouped variables and the model as a whole.
For details, see the tbl
output argument
description.
sstype
— Sum of squares type'h'
(default)  1
 2
 3
Sum of squares type for each term, specified as one of the values in this table.
Value  Description 

1  Type 1 sum of squares — Reduction in residual sum of squares obtained by adding the term to a fit that already includes the preceding terms 
2  Type 2 sum of squares — Reduction in residual sum of squares obtained by adding the term to a model that contains all other terms 
3  Type 3 sum of squares — Reduction in residual sum of squares obtained by adding the term to a model that contains all other terms, but with their effects constrained to obey the usual “sigma restrictions” that make models estimable 
'h'  Hierarchical model — Similar to Type 2, but uses both continuous and categorical factors to determine the hierarchy of terms 
The sum of squares for any term is determined by comparing two models. For a model containing
main effects but no interactions, the value of sstype
influences the
computations on unbalanced data only.
Suppose you are fitting a model with two factors and their interaction, and the terms appear in the order A, B, AB. Let R(·) represent the residual sum of squares for the model. So, R(A, B, AB) is the residual sum of squares fitting the whole model, R(A) is the residual sum of squares fitting the main effect of A only, and R(1) is the residual sum of squares fitting the mean only. The three sum of squares types are as follows:
Term  Type 1 Sum of Squares  Type 2 Sum of Squares  Type 3 Sum of Squares 

A  R(1) – R(A)  R(B) – R(A, B)  R(B, AB) – R(A, B, AB) 
B  R(A) – R(A, B)  R(A) – R(A, B)  R(A, AB) – R(A, B, AB) 
AB  R(A, B) – R(A, B, AB)  R(A, B) – R(A, B, AB)  R(A, B) – R(A, B, AB) 
The models for Type 3 sum of squares have sigma restrictions imposed. This means, for example, that in fitting R(B, AB), the array of AB effects is constrained to sum to 0 over A for each value of B, and over B for each value of A.
For Type 3 sum of squares:
If mdl
is a
CompactLinearModel
object and the regression
model is nonhierarchical, anova
returns an
error.
If mdl
is a LinearModel
object and the regression model is nonhierarchical,
anova
refits the model using effects coding
whenever it needs to compute a Type 3 sum of squares.
If the regression model in mdl
is
hierarchical, anova
computes the results
without refitting the model.
sstype
applies only if anovatype
is 'component'
.
tbl
— ANOVA summary statistics tableANOVA summary statistics table, returned as a table.
The contents of tbl
depend on the ANOVA type
specified in anovatype
.
If anovatype
is
'component'
, then tbl
contains ANOVA statistics for each variable in the model except the
constant (intercept) term. The table includes these columns for each
variable:
Column  Description 

SumSq  Sum of squares explained by the term,
computed depending on

DF  Degrees of freedom

MeanSq  Mean square, defined by

F  Fstatistic value to
test the null hypothesis that the corresponding
coefficient is zero, computed by
When
the null hypothesis is true, the
Fstatistic follows the
Fdistribution. The numerator
degrees of freedom is the 
pValue  pvalue of the Fstatistic value 
For an example, see Component ANOVA Table.
If anovatype
is 'summary'
,
then tbl
contains summary statistics of grouped
terms for each row. The table includes the same columns as
'component'
and these rows:
Row  Description 

Total  Total statistics

Model  Statistics for the model as a whole
If

Linear  Statistics for linear terms

Nonlinear  Statistics for nonlinear terms

Residual  Statistics for residuals
If

Lack of fit  Lackoffit statistics

Pure error  Statistics for pure error

For an example, see Summary ANOVA Table.
More complete ANOVA statistics are available in the anova1
, anova2
, and anovan
functions.
CompactLinearModel
 LinearModel
 coefCI
 coefTest
 dwtest
A modified version of this example exists on your system. Do you want to open this version instead?
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
Select web siteYou can also select a web site from the following list:
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.