MATLAB Examples

Assess Fit of Model Using F-statistic

This example shows how to use assess the fit of the model and the significance of the regression coefficients using F-statistic.

Load the sample data.

load hospital
tbl = table(hospital.Age,hospital.Weight,hospital.Smoker,hospital.BloodPressure(:,1), ...
      'VariableNames',{'Age','Weight','Smoker','BloodPressure'});
tbl.Smoker = categorical(tbl.Smoker);

Fit a linear regression model.

mdl = fitlm(tbl,'BloodPressure ~ Age*Weight + Smoker + Weight^2')
mdl = 


Linear regression model:
    BloodPressure ~ 1 + Smoker + Age*Weight + Weight^2

Estimated Coefficients:
                    Estimate        SE         tStat        pValue  
                   __________    _________    ________    __________

    (Intercept)        168.02       27.694       6.067    2.7149e-08
    Age              0.079569      0.39861     0.19962       0.84221
    Weight           -0.69041       0.3435     -2.0099      0.047305
    Smoker_true        9.8027       1.0256      9.5584    1.5969e-15
    Age:Weight     0.00021796    0.0025258    0.086294       0.93142
    Weight^2        0.0021877    0.0011037      1.9822      0.050375


Number of observations: 100, Error degrees of freedom: 94
Root Mean Squared Error: 4.73
R-squared: 0.528,  Adjusted R-Squared 0.503
F-statistic vs. constant model: 21, p-value = 4.81e-14

The F-statistic of the linear fit versus the constant model is 168.02, with a p-value of 2.71e-08. The model is significant at the 5% significance level. The R-squared value of 0.528 means the model explains about 53% of the variability in the response. There might be other predictor (explanatory) variables that are not included in the current model.

Display the ANOVA table for the fitted model.

anova(mdl,'summary')
ans =

  5x5 table

                   SumSq     DF    MeanSq      F         pValue  
                   ______    __    ______    ______    __________

    Total          4461.2    99    45.062                        
    Model          2354.5     5     470.9    21.012    4.8099e-14
    . Linear       2263.3     3    754.42    33.663    7.2417e-15
    . Nonlinear    91.248     2    45.624    2.0358        0.1363
    Residual       2106.6    94    22.411                        

This display separates the variability in the model into linear and nonlinear terms. Since there are two non-linear terms (Weight^2 and the interaction between Weight and Age), the nonlinear degrees of freedom in the DF column is 2. There are three linear terms in the model (one Smoker indicator variable, Weight, and Age). The corresponding F-statistics in the F column are for testing the significance of the linear and nonlinear terms as separate groups.

When there are replicated observations, the residual term is also separated into two parts; first is the error due to the lack of fit, and second is the pure error independent from the model, obtained from the replicated observations. In that case, the F-statistic is for testing the lack of fit, that is, whether the fit is adequate or not. But, in this example, there are no replicated observations.

Display the ANOVA table for the model terms.

anova(mdl)
ans =

  6x5 table

                   SumSq      DF     MeanSq         F          pValue  
                  ________    __    ________    _________    __________

    Age             62.991     1      62.991       2.8107      0.096959
    Weight        0.064104     1    0.064104    0.0028604       0.95746
    Smoker          2047.5     1      2047.5       91.363    1.5969e-15
    Age:Weight     0.16689     1     0.16689    0.0074466       0.93142
    Weight^2        88.057     1      88.057       3.9292      0.050375
    Error           2106.6    94      22.411                           

This display decomposes the ANOVA table into the model terms. The corresponding F-statistics in the F column are for assessing the statistical significance of each term. The F-test for Cylinders test whether the coefficient of the indicator variable for smoker is different from zero or not. That is, whether being a smoker has a significant effect on MPG or not. The degrees of freedom for each model term is the numerator degrees of freedom for the corresponding F-test. All of the terms have 1 degree of freedom. In case of a categorical variable, the degrees of freedom is the number of indicator variable. Smoker has only one indicator variable, so the degrees of freedom for that is also 1.