This example shows how to fit a generalized linear model and analyze the results. A typical workflow involves these steps: import data, fit a generalized linear model, test its quality, modify the model to improve its quality, and make predictions based on the model. In this example, you use the Fisher iris data to compute the probability that a flower is in one of two classes.
Load the Fisher iris data.
Extract rows 51 to 150, which have the classification versicolor or virginica.
X = meas(51:end,:);
Create logical response variables that are
true for versicolor and
false for virginica.
y = strcmp('versicolor',species(51:end));
Fit a binomial generalized linear model to the data.
mdl = fitglm(X,y,'linear','Distribution','binomial')
mdl = Generalized linear regression model: logit(y) ~ 1 + x1 + x2 + x3 + x4 Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue ________ ______ _______ ________ (Intercept) 42.638 25.708 1.6586 0.097204 x1 2.4652 2.3943 1.0296 0.30319 x2 6.6809 4.4796 1.4914 0.13585 x3 -9.4294 4.7372 -1.9905 0.046537 x4 -18.286 9.7426 -1.8769 0.060529 100 observations, 95 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 127, p-value = 1.95e-26
According to the model display, some p-values in the
pValue column are not small, which implies that you can simplify the model.
Determine if 95% confidence intervals for the coefficients include 0. If so, you can remove the model terms with those intervals.
confint = coefCI(mdl)
confint = 5×2 -8.3984 93.6740 -2.2881 7.2185 -2.2122 15.5739 -18.8339 -0.0248 -37.6277 1.0554
Only the fourth predictor
x3 has a coefficient whose confidence interval does not include 0.
The coefficients of
x2 have large p-values and their 95% confidence intervals include 0. Test whether both coefficients can be zero. Specify a hypothesis matrix to select the coefficients of
M = [0 1 0 0 0 0 0 1 0 0]; p = coefTest(mdl,M)
p = 0.1442
The p-value is approximately 0.14, which is not small. Remove
x2 from the model.
mdl1 = removeTerms(mdl,'x1 + x2')
mdl1 = Generalized linear regression model: logit(y) ~ 1 + x3 + x4 Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue ________ ______ _______ __________ (Intercept) 45.272 13.612 3.326 0.00088103 x3 -5.7545 2.3059 -2.4956 0.012576 x4 -10.447 3.7557 -2.7816 0.0054092 100 observations, 97 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 118, p-value = 2.3e-26
Alternatively, you can identify important predictors using
mdl2 = stepwiseglm(X,y,'constant','Distribution','binomial','Upper','linear')
1. Adding x4, Deviance = 33.4208, Chi2Stat = 105.2086, PValue = 1.099298e-24 2. Adding x3, Deviance = 20.5635, Chi2Stat = 12.8573, PValue = 0.000336166 3. Adding x2, Deviance = 13.2658, Chi2Stat = 7.29767, PValue = 0.00690441
mdl2 = Generalized linear regression model: logit(y) ~ 1 + x2 + x3 + x4 Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue ________ ______ _______ ________ (Intercept) 50.527 23.995 2.1057 0.035227 x2 8.3761 4.7612 1.7592 0.078536 x3 -7.8745 3.8407 -2.0503 0.040334 x4 -21.43 10.707 -2.0014 0.04535 100 observations, 96 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 125, p-value = 5.4e-27
The p-value (
x2 in the coefficient table is greater than 0.05, but
x2 in the model because the p-value (
PValue) for adding
x2 is smaller than 0.05. The
stepwiseglm function computes
PValue using the fits with and without
x2, whereas the function computes
pValue based on an approximate standard error computed only from the final model. Therefore,
PValue is more reliable than
Examine a leverage plot to look for influential outliers.
An observation can be considered an outlier if its leverage substantially exceeds
p is the number of coefficients and
n is the number of observations. The dotted reference line is a recommended threshold, computed by
2*p/n, which corresponds to 0.08 in this plot. Some observations have leverage values larger than
10*p/n (that is, 0.40). Identify these observation points.
idxOutliers = find(mdl2.Diagnostics.Leverage > 10*mdl2.NumCoefficients/mdl2.NumObservations)
idxOutliers = 4×1 19 21 57 85
See if the model coefficients change when you fit a model excluding these points.
oldCoeffs = mdl2.Coefficients.Estimate; mdl3 = fitglm(X,y,'linear','Distribution','binomial', ... 'PredictorVars',2:4,'Exclude',idxOutliers); newCoeffs = mdl3.Coefficients.Estimate; disp([oldCoeffs newCoeffs])
50.5268 44.0085 8.3761 5.6361 -7.8745 -6.1145 -21.4296 -18.1236
The model coefficients in
mdl3 are different from those in mdl2. This result implies that the responses at the high-leverage points are not consistent with the predicted values from the reduced model.
mdl3 to predict the probability that a flower with average measurements is versicolor. Generate confidence intervals for the prediction.
[newf,newc] = predict(mdl3,mean(X))
newf = 0.4558
newc = 1×2 0.1234 0.8329
The model gives almost a 46% probability that the average flower is versicolor, with a wide confidence interval.