anova

Analysis of variance for linear regression model

collapse all in page

Syntax

tbl = anova(mdl)

tbl = anova(mdl,anovatype)

tbl = anova(mdl,'component',sstype)

Description

example

tbl = anova(mdl) returns a table with component ANOVA statistics.

example

tbl = anova(mdl,anovatype) returns ANOVA statistics of the specified type anovatype. For example, specify anovatype as 'component'(default) to return a table with component ANOVA statistics, or specify anovatype as 'summary' to return a table with summary ANOVA statistics.

tbl = anova(mdl,'component',sstype) computes component ANOVA statistics using the specified type of sum of squares.

Examples

collapse all

Component ANOVA Table

Open Live Script

Create a component ANOVA table from a linear regression model of the hospital data set.

Load the hospital data set and create a model of blood pressure as a function of age and gender.

load hospital
tbl = table(hospital.Age,hospital.Sex,hospital.BloodPressure(:,2), ...
    'VariableNames',{'Age','Sex','BloodPressure'});
tbl.Sex = categorical(tbl.Sex);
mdl = fitlm(tbl,'BloodPressure ~ Sex + Age^2')

mdl = 
Linear regression model:
    BloodPressure ~ 1 + Age + Sex + Age^2

Estimated Coefficients:
                   Estimate        SE        tStat       pValue  
                   _________    ________    ________    _________

    (Intercept)       63.942      19.194      3.3314    0.0012275
    Age              0.90673      1.0442     0.86837      0.38736
    Sex_Male          3.0019      1.3765      2.1808     0.031643
    Age^2          -0.011275    0.013853    -0.81389      0.41772


Number of observations: 100, Error degrees of freedom: 96
Root Mean Squared Error: 6.83
R-squared: 0.0577,  Adjusted R-Squared: 0.0283
F-statistic vs. constant model: 1.96, p-value = 0.125

Create an ANOVA table of the model.

tbl = anova(mdl)

tbl=4×5 table
             SumSq     DF    MeanSq       F        pValue 
             ______    __    ______    _______    ________

    Age      18.705     1    18.705    0.40055     0.52831
    Sex      222.09     1    222.09     4.7558    0.031643
    Age^2    30.934     1    30.934    0.66242     0.41772
    Error    4483.1    96    46.699

The table displays the following columns for each term except the constant (intercept) term:

SumSq — Sum of squares explained by the term.
DF — Degrees of freedom. In this example, DF is 1 for each term in the model and n – p for the error term, where n is the number of observations and p is the number of coefficients (including the intercept) in the model. For example, the DF for the error term in this model is 100 – 4 = 96. If any variable in the model is a categorical variable, the DF for that variable is the number of indicator variables created for its categories (number of categories – 1).
MeanSq — Mean square, defined by MeanSq = SumSq/DF. For example, the mean square of the error term, mean squared error (MSE), is 4.4831e+03/96 = 46.6991.
F — F-statistic value to test the null hypothesis that the corresponding coefficient is zero, computed by F = MeanSq/MSE, where MSE is the mean squared error. When the null hypothesis is true, the F-statistic follows the F-distribution. The numerator degrees of freedom is the DF value for the corresponding term, and the denominator degrees of freedom is n – p. In this example, each F-statistic follows an $F_{(1, 96)}$ -distribution.
pValue — p-value of the F-statistic value. For example, the p-value for Age is 0.5283, implying that Age is not significant at the 5% significance level given the other terms in the model.

Summary ANOVA Table

Open Live Script

Create a summary ANOVA table from a linear regression model of the hospital data set.

Load the hospital data set and create a model of blood pressure as a function of age and gender.

load hospital
tbl = table(hospital.Age,hospital.Sex,hospital.BloodPressure(:,2), ...
    'VariableNames',{'Age','Sex','BloodPressure'});
tbl.Sex = categorical(tbl.Sex);
mdl = fitlm(tbl,'BloodPressure ~ Sex + Age^2')

mdl = 
Linear regression model:
    BloodPressure ~ 1 + Age + Sex + Age^2

Estimated Coefficients:
                   Estimate        SE        tStat       pValue  
                   _________    ________    ________    _________

    (Intercept)       63.942      19.194      3.3314    0.0012275
    Age              0.90673      1.0442     0.86837      0.38736
    Sex_Male          3.0019      1.3765      2.1808     0.031643
    Age^2          -0.011275    0.013853    -0.81389      0.41772


Number of observations: 100, Error degrees of freedom: 96
Root Mean Squared Error: 6.83
R-squared: 0.0577,  Adjusted R-Squared: 0.0283
F-statistic vs. constant model: 1.96, p-value = 0.125

Create a summary ANOVA table of the model.

tbl = anova(mdl,'summary')

tbl=7×5 table
                     SumSq     DF    MeanSq       F        pValue 
                     ______    __    ______    _______    ________

    Total            4757.8    99    48.059                       
    Model            274.73     3    91.577      1.961     0.12501
    . Linear          243.8     2     121.9     2.6103    0.078726
    . Nonlinear      30.934     1    30.934    0.66242     0.41772
    Residual         4483.1    96    46.699                       
    . Lack of fit    1483.1    39    38.028    0.72253     0.85732
    . Pure error       3000    57    52.632

The table displays tests for groups of terms: Total, Model, and Residual.

Total — This row shows the total sum of squares (SumSq), degrees of freedom (DF), and the mean squared error (MeanSq). Note that MeanSq = SumSq/DF.
Model — This row includes SumSq, DF, MeanSq, F-statistic value (F), and p-value (pValue). Because this model includes a nonlinear term (Age^2), anova partitions the sum of squares (SumSq) of Model into two parts: SumSq explained by the linear terms (Age and Sex) and SumSq explained by the nonlinear term (Age^2). The corresponding F-statistic values are for testing the significance of the linear terms and the nonlinear term as separate groups. The nonlinear group consists of the Age^2 term only, so it has the same p-value as the Age^2 term in the Component ANOVA Table.
Residual — This row includes SumSq, DF, MeanSq, F, and pValue. Because the data set includes replications, anova partitions the residual SumSq into the part for the replications (Pure error) and the rest (Lack of fit). To test the lack of fit, anova computes the F-statistic value by comparing the model residuals to the model-free variance estimate computed on the replications. The F-statistic value shows no evidence of lack of fit.

Linear Regression with Categorical Predictor

Open Live Script

Fit a linear regression model that contains a categorical predictor. Reorder the categories of the categorical predictor to control the reference level in the model. Then, use anova to test the significance of the categorical variable.

Model with Categorical Predictor

Load the carsmall data set and create a linear regression model of MPG as a function of Model_Year. To treat the numeric vector Model_Year as a categorical variable, identify the predictor using the 'CategoricalVars' name-value pair argument.

load carsmall
mdl = fitlm(Model_Year,MPG,'CategoricalVars',1,'VarNames',{'Model_Year','MPG'})

mdl = 
Linear regression model:
    MPG ~ 1 + Model_Year

Estimated Coefficients:
                     Estimate      SE      tStat       pValue  
                     ________    ______    ______    __________

    (Intercept)        17.69     1.0328    17.127    3.2371e-30
    Model_Year_76     3.8839     1.4059    2.7625     0.0069402
    Model_Year_82      14.02     1.4369    9.7571    8.2164e-16


Number of observations: 94, Error degrees of freedom: 91
Root Mean Squared Error: 5.56
R-squared: 0.531,  Adjusted R-Squared: 0.521
F-statistic vs. constant model: 51.6, p-value = 1.07e-15

The model formula in the display, MPG ~ 1 + Model_Year, corresponds to

$MPG = β_{0} + β_{1} Ι_{Year = 76} + β_{2} Ι_{Year = 82} + ϵ$ ,

where $Ι_{Year = 76}$ and $Ι_{Year = 82}$ are indicator variables whose value is one if the value of Model_Year is 76 and 82, respectively. The Model_Year variable includes three distinct values, which you can check by using the unique function.

unique(Model_Year)

fitlm chooses the smallest value in Model_Year as a reference level ('70') and creates two indicator variables $Ι_{Year = 76}$ and $Ι_{Year = 82}$ . The model includes only two indicator variables because the design matrix becomes rank deficient if the model includes three indicator variables (one for each level) and an intercept term.

Model with Full Indicator Variables

You can interpret the model formula of mdl as a model that has three indicator variables without an intercept term:

$y = β_{0} Ι_{x_{1} = 70} + (β_{0} + β_{1}) Ι_{x_{1} = 76} + ({β_{0} + β}_{2}) Ι_{x_{2} = 82} + ϵ$ .

Alternatively, you can create a model that has three indicator variables without an intercept term by manually creating indicator variables and specifying the model formula.

temp_Year = dummyvar(categorical(Model_Year));
Model_Year_70 = temp_Year(:,1);
Model_Year_76 = temp_Year(:,2);
Model_Year_82 = temp_Year(:,3);
tbl = table(Model_Year_70,Model_Year_76,Model_Year_82,MPG);
mdl = fitlm(tbl,'MPG ~ Model_Year_70 + Model_Year_76 + Model_Year_82 - 1')

mdl = 
Linear regression model:
    MPG ~ Model_Year_70 + Model_Year_76 + Model_Year_82

Estimated Coefficients:
                     Estimate      SE       tStat       pValue  
                     ________    _______    ______    __________

    Model_Year_70      17.69      1.0328    17.127    3.2371e-30
    Model_Year_76     21.574     0.95387    22.617    4.0156e-39
    Model_Year_82      31.71     0.99896    31.743    5.2234e-51


Number of observations: 94, Error degrees of freedom: 91
Root Mean Squared Error: 5.56

Choose Reference Level in Model

You can choose a reference level by modifying the order of categories in a categorical variable. First, create a categorical variable Year.

Year = categorical(Model_Year);

Check the order of categories by using the categories function.

categories(Year)

ans = 3x1 cell
    {'70'}
    {'76'}
    {'82'}

If you use Year as a predictor variable, then fitlm chooses the first category '70' as a reference level. Reorder Year by using the reordercats function.

Year_reordered = reordercats(Year,{'76','70','82'});
categories(Year_reordered)

ans = 3x1 cell
    {'76'}
    {'70'}
    {'82'}

The first category of Year_reordered is '76'. Create a linear regression model of MPG as a function of Year_reordered.

mdl2 = fitlm(Year_reordered,MPG,'VarNames',{'Model_Year','MPG'})

mdl2 = 
Linear regression model:
    MPG ~ 1 + Model_Year

Estimated Coefficients:
                     Estimate      SE        tStat       pValue  
                     ________    _______    _______    __________

    (Intercept)       21.574     0.95387     22.617    4.0156e-39
    Model_Year_70    -3.8839      1.4059    -2.7625     0.0069402
    Model_Year_82     10.136      1.3812     7.3385    8.7634e-11


Number of observations: 94, Error degrees of freedom: 91
Root Mean Squared Error: 5.56
R-squared: 0.531,  Adjusted R-Squared: 0.521
F-statistic vs. constant model: 51.6, p-value = 1.07e-15

mdl2 uses '76' as a reference level and includes two indicator variables $Ι_{Year = 70}$ and $Ι_{Year = 82}$ .

Evaluate Categorical Predictor

The model display of mdl2 includes a p-value of each term to test whether or not the corresponding coefficient is equal to zero. Each p-value examines each indicator variable. To examine the categorical variable Model_Year as a group of indicator variables, use anova. Use the 'components'(default) option to return a component ANOVA table that includes ANOVA statistics for each variable in the model except the constant term.

anova(mdl2,'components')

ans=2×5 table
                  SumSq     DF    MeanSq      F        pValue  
                  ______    __    ______    _____    __________

    Model_Year    3190.1     2    1595.1    51.56    1.0694e-15
    Error         2815.2    91    30.936

The component ANOVA table includes the p-value of the Model_Year variable, which is smaller than the p-values of the indicator variables.

Input Arguments

collapse all

`mdl` — Linear regression model object
`LinearModel` object | `CompactLinearModel` object

Linear regression model object, specified as a LinearModel object created by using fitlm or stepwiselm, or a CompactLinearModel object created by using compact.

`anovatype` — ANOVA type
`'component'` (default) | `'summary'`

ANOVA type, specified as one of these values:

'component' — anova returns the table tbl with ANOVA statistics for each variable in the model except the constant term.
'summary' — anova returns the table tbl with summary ANOVA statistics for grouped variables and the model as a whole.

For details, see the tbl output argument description.

`sstype` — Sum of squares type
`'h'` (default) | `1` | `2` | `3`

Sum of squares type for each term, specified as one of the values in this table.

Value	Description
`1`	Type 1 sum of squares — Reduction in residual sum of squares obtained by adding the term to a fit that already includes the preceding terms
`2`	Type 2 sum of squares — Reduction in residual sum of squares obtained by adding the term to a model that contains all other terms
`3`	Type 3 sum of squares — Reduction in residual sum of squares obtained by adding the term to a model that contains all other terms, but with their effects constrained to obey the usual “sigma restrictions” that make models estimable
`'h'`	Hierarchical model — Similar to Type 2, but uses both continuous and categorical factors to determine the hierarchy of terms

The sum of squares for any term is determined by comparing two models. For a model containing main effects but no interactions, the value of sstype influences the computations on unbalanced data only.

Suppose you are fitting a model with two factors and their interaction, and the terms appear in the order A, B, AB. Let R(·) represent the residual sum of squares for the model. So, R(A, B, AB) is the residual sum of squares fitting the whole model, R(A) is the residual sum of squares fitting the main effect of A only, and R(1) is the residual sum of squares fitting the mean only. The three sum of squares types are as follows:

Term	Type 1 Sum of Squares	Type 2 Sum of Squares	Type 3 Sum of Squares
A	R(1) – R(A)	R(B) – R(A, B)	R(B, AB) – R(A, B, AB)
B	R(A) – R(A, B)	R(A) – R(A, B)	R(A, AB) – R(A, B, AB)
AB	R(A, B) – R(A, B, AB)	R(A, B) – R(A, B, AB)	R(A, B) – R(A, B, AB)

The models for Type 3 sum of squares have sigma restrictions imposed. This means, for example, that in fitting R(B, AB), the array of AB effects is constrained to sum to 0 over A for each value of B, and over B for each value of A.

For Type 3 sum of squares:

If mdl is a CompactLinearModel object and the regression model is nonhierarchical, anova returns an error.
If mdl is a LinearModel object and the regression model is nonhierarchical, anova refits the model using effects coding whenever it needs to compute a Type 3 sum of squares.
If the regression model in mdl is hierarchical, anova computes the results without refitting the model.

sstype applies only if anovatype is 'component'.

Output Arguments

collapse all

`tbl` — ANOVA summary statistics table
table

ANOVA summary statistics table, returned as a table.

The contents of tbl depend on the ANOVA type specified in anovatype.

If anovatype is 'component', then tbl contains ANOVA statistics for each variable in the model except the constant (intercept) term. The table includes these columns for each variable:

Column	Description
`SumSq`	Sum of squares explained by the term, computed depending on `sstype`
`DF`	Degrees of freedom `DF` of a numeric variable is 1. `DF` of a categorical variable is the number of indicator variables created for the category (number of categories – 1). Note that `tbl` contains one row for each categorical variable instead of one row for each indicator variable as in the model display. Use `anova` to test a categorical variable as a group of indicator variables. `DF` of an error term is n – p, where n is the number of observations and p is the number of coefficients in the model.
`MeanSq`	Mean square, defined by `MeanSq` = `SumSq`/`DF` `MeanSq` for the error term is the mean squared error (MSE).
`F`	F-statistic value to test the null hypothesis that the corresponding coefficient is zero, computed by `F` = `MeanSq`/`MSE` When the null hypothesis is true, the F-statistic follows the F-distribution. The numerator degrees of freedom is the `DF` value for the corresponding term, and the denominator degrees of freedom is n – p.
`pValue`	p-value of the F-statistic value

For an example, see Component ANOVA Table.

If anovatype is 'summary', then tbl contains summary statistics of grouped terms for each row. The table includes the same columns as 'component' and these rows:

Row	Description
`Total`	Total statistics `SumSq` — Total sum of squares, which is the sum of the squared deviations of the response around its mean `DF` — Sum of degrees of freedom of `Model` and `Residual`
`Model`	Statistics for the model as a whole `SumSq` — Model sum of squares, which is the sum of the squared deviations of the fitted value around the response mean. `F` and `pValue` — These values provide a test of whether the model as a whole fits significantly better than a degenerate model consisting of only a constant term. If `mdl` includes only linear terms, then `anova` does not decompose `Model` into `Linear` and `NonLinear`.
`Linear`	Statistics for linear terms `SumSq` — Sum of squares for linear terms, which is the difference between the model sum of squares and the sum of squares for nonlinear terms. `F` and `pValue` — These values provide a test of whether the model with only linear terms fits better than a degenerate model consisting of only a constant term. `anova` uses the mean squared error that is based on the full model to compute this F-value, so the F-value obtained by dropping the nonlinear terms and repeating the test is not the same as the value in this row.
`Nonlinear`	Statistics for nonlinear terms `SumSq` — Sum of squares for nonlinear (higher-order or interaction) terms, which is the increase in the residual sum of squares obtained by keeping only the linear terms and dropping all nonlinear terms. `F` and `pValue` — These values provide a test of whether the full model fits significantly better than a smaller model consisting of only the linear terms.
`Residual`	Statistics for residuals `SumSq` — Residual sum of squares, which is the sum of the squared residual values `MeanSq` — Mean squared error, used to compute the F-statistic values for `Model`, `Linear`, and `NonLinear` If `mdl` is a full `LinearModel` object and the sample data contains replications (multiple observations sharing the same predictor values), then `anova` decomposes the residual sum of squares into a sum of squares for the replicated observations (`Lack of fit`) and the remaining sum of squares (`Pure error`).
`Lack of fit`	Lack-of-fit statistics `SumSq` — Sum of squares due to lack of fit, which is the difference between the residual sum of squares and the replication sum of squares. `F` and `pValue` — The F-statistic value is the ratio of lack-of-fit `MeanSq` to pure error `MeanSq`. The ratio provides a test of bias by measuring whether the variation of the residuals is larger than the variation of the replications. A low p-value implies that adding additional terms to the model can improve the fit.
`Pure error`	Statistics for pure error `SumSq` — Replication sum of squares, obtained by finding the sets of points with identical predictor values, computing the sum of squared deviations around the mean within each set, and pooling the computed values `MeanSq` — Model-free pure error variance estimate of the response

For an example, see Summary ANOVA Table.

Alternative Functionality

More complete ANOVA statistics are available in the anova1, anova2, and anovan functions.

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Usage notes and limitations:

This function supports model objects fitted with GPU array input arguments.

For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Documentation

anova

Syntax

Description

Examples

Component ANOVA Table

Summary ANOVA Table

Linear Regression with Categorical Predictor

Input Arguments

`mdl` — Linear regression model object
`LinearModel` object | `CompactLinearModel` object

`anovatype` — ANOVA type
`'component'` (default) | `'summary'`

`sstype` — Sum of squares type
`'h'` (default) | `1` | `2` | `3`

Output Arguments

`tbl` — ANOVA summary statistics table
table

Alternative Functionality

Extended Capabilities

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

See Also

Topics

Statistics and Machine Learning Toolbox Documentation

Support

Documentation

anova

Syntax

Description

Examples

Component ANOVA Table

Summary ANOVA Table

Linear Regression with Categorical Predictor

Input Arguments

mdl — Linear regression model object LinearModel object | CompactLinearModel object

anovatype — ANOVA type 'component' (default) | 'summary'

sstype — Sum of squares type 'h' (default) | 1 | 2 | 3

Output Arguments

tbl — ANOVA summary statistics table table

Alternative Functionality

Extended Capabilities

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

See Also

Topics

Statistics and Machine Learning Toolbox Documentation

Support

`mdl` — Linear regression model object
`LinearModel` object | `CompactLinearModel` object

`anovatype` — ANOVA type
`'component'` (default) | `'summary'`

`sstype` — Sum of squares type
`'h'` (default) | `1` | `2` | `3`

`tbl` — ANOVA summary statistics table
table

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.