Documentation

GeneralizedLinearModel class

Generalized linear regression model class

Description

An object comprising training data, model description, diagnostic information, and fitted coefficients for a generalized linear regression. Predict model responses with the predict or feval methods.

Construction

mdl = fitglm(tbl) or mdl = fitglm(X,y) creates a generalized linear model of a table or dataset array tbl, or of the responses y to a data matrix X. For details, see fitglm.

mdl = stepwiseglm(tbl) or mdl = stepwiseglm(X,y) creates a generalized linear model of a table or dataset array tbl, or of the responses y to a data matrix X, with unimportant predictors excluded. For details, see stepwiseglm.

Input Arguments

expand all

tbl — Input datatable | dataset array

Input data, specified as a table or dataset array. When modelspec is a formula, it specifies the variables to be used as the predictors and response. Otherwise, if you do not specify the predictor and response variables, the last variable is the response variable and the others are the predictor variables by default.

Predictor variables can be numeric, or any grouping variable type, such as logical or categorical (see Grouping Variables). The response must be numeric or logical.

To set a different column as the response variable, use the ResponseVar name-value pair argument. To use a subset of the columns as predictors, use the PredictorVars name-value pair argument.

Data Types: single | double | logical

X — Predictor variablesmatrix

Predictor variables, specified as an n-by-p matrix, where n is the number of observations and p is the number of predictor variables. Each column of X represents one variable, and each row represents one observation.

By default, there is a constant term in the model, unless you explicitly remove it, so do not include a column of 1s in X.

Data Types: single | double | logical

y — Response variablevector

Response variable, specified as an n-by-1 vector, where n is the number of observations. Each entry in y is the response for the corresponding row of X.

Data Types: single | double

Properties

expand all

CoefficientCovarianceCovariance matrix of coefficient estimatesnumeric matrix

Covariance matrix of coefficient estimates, stored as a p-by-p matrix of numeric values. p is the number of coefficients in the fitted model.

CoefficientNamesCoefficient namescell array of strings

Coefficient names, stored as a cell array of strings containing a label for each coefficient.

CoefficientsCoefficient valuestable

Coefficient values, stored as a table. Coefficients has one row for each coefficient and the following columns:

  • Estimate — Estimated coefficient value

  • SE — Standard error of the estimate

  • tStatt statistic for a test that the coefficient is zero

  • pValuep-value for the t statistic

To obtain any of these columns as a vector, index into the property using dot notation. For example, in mdl the estimated coefficient vector is

beta = mdl.Coefficients.Estimate

Use coefTest to perform other tests on the coefficients.

DevianceDeviance of the fitnumeric value

Deviance of the fit, stored as a numeric value. Deviance is useful for comparing two models when one is a special case of the other. The difference between the deviance of the two models has a chi-square distribution with degrees of freedom equal to the difference in the number of estimated parameters between the two models. For more information on deviance, see Deviance.

DFEDegrees of freedom for errorpositive integer value

Degrees of freedom for error (residuals), equal to the number of observations minus the number of estimated coefficients, stored as a positive integer value.

DiagnosticsDiagnostic informationtable

Diagnostic information for the model, stored as a table. Diagnostics can help identify outliers and influential observations. Diagnostics contains the following fields:

FieldMeaningUtility
LeverageDiagonal elements of HatMatrixLeverage indicates to what extent the predicted value for an observation is determined by the observed value for that observation. A value close to 1 indicates that the prediction is largely determined by that observation, with little contribution from the other observations. A value close to 0 indicates the fit is largely determined by the other observations. For a model with p coefficients and n observations, the average value of Leverage is p/n. An observation with Leverage larger than 2*p/n can be an outlier.
CooksDistanceCook's measure of scaled change in fitted valuesCooksDistance is a measure of scaled change in fitted values. An observation with CooksDistance larger than three times the mean Cook's distance can be an outlier.
HatMatrixProjection matrix to compute fitted from observed responsesHatMatrix is an n-by-n matrix such that Fitted = HatMatrix*Y, where Y is the response vector and Fitted is the vector of fitted response values.

All of these quantities are computed on the scale of the linear predictor. So, for example, in the equation that defines the hat matrix,

Yfit = glm.Fitted.LinearPredictor
Y = glm.Fitted.LinearPredictor + glm.Residuals.LinearPredictor

DispersionScale factor of the variance of the responsestructure

Scale factor of the variance of the response, stored as a structure. Dispersion multiplies the variance function for the distribution.

For example, the variance function for the binomial distribution is p(1–p)/n, where p is the probability parameter and n is the sample size parameter. If Dispersion is near 1, the variance of the data appears to agree with the theoretical variance of the binomial distribution. If Dispersion is larger than 1, the data are "overdispersed" relative to the binomial distribution.

DispersionEstimatedFlag to indicate use of dispersion scale factorlogical value

Flag to indicate whether fitglm used the Dispersion scale factor to compute standard errors for the coefficients in Coefficients.SE, stored as a logical value. If DispersionEstimated is false, fitglm used the theoretical value of the variance.

  • DispersionEstimated can be false only for 'binomial' or 'poisson' distributions.

  • Set DispersionEstimated by setting the DispersionFlag name-value pair in fitglm.

DistributionGeneralized distribution informationstructure

Generalized distribution information, stored as a structure with the following fields relating to the generalized distribution:

FieldDescription
NameName of the distribution, one of 'normal', 'binomial', 'poisson', 'gamma', or 'inverse gamma'.
DevianceFunctionFunction that computes the components of the deviance as a function of the fitted parameter values and the response values.
VarianceFunctionFunction that computes the theoretical variance for the distribution as a function of the fitted parameter values. When DispersionEstimated is true, Dispersion multiplies the variance function in the computation of the coefficient standard errors.

FittedFitted response values based on input datatable

Fitted (predicted) values based on the input data, stored as a table with one row for each observation and the following columns.

FieldDescription
ResponsePredicted values on the scale of the response.
LinearPredictorPredicted values on the scale of the linear predictor. These are the same as the link function applied to the Response fitted values.
ProbabilityFitted probabilities (this column is included only with the binomial distribution).

To obtain any of the columns as a vector, index into the property using dot notation. For example, in the model mdl, the vector f of fitted values on the response scale is

f = mdl.Fitted.Response

Use predict to compute predictions for other predictor values, or to compute confidence bounds on Fitted.

FormulaModel informationLinearFormula object | NonLinearFormula object

Model information, stored as a LinearFormula object or NonLinearFormula object. If you fit a linear or generalized linear regression model, then Formula is a LinearFormula object. If you fit a nonlinear regression model, then Formula is a NonLinearFormula object.

Link function, stored as a structure with the following fields:

FieldDescription
NameName of the link function, or '' if you specified the link as a function handle rather than a string.
LinkFunctionThe function that defines f, a function handle.
DevianceFunctionDerivative of f, a function handle.
VarianceFunctionInverse of f, a function handle.

The link is a function f that links the distribution parameter μ to the fitted linear combination Xb of the predictors:

f(μ) = Xb.

LogLikelihoodLog likelihoodnumeric value

Log likelihood of the model distribution at the response values, stored as a numeric value. The mean is fitted from the model, and other parameters are estimated as part of the model fit.

ModelCriterionCriterion for model comparisonstructure

Criterion for model comparison, stored as a structure with the following fields:

  • AIC — Akaike information criterion

  • AICc — Akaike information criterion corrected for sample size

  • BIC — Bayesian information criterion

  • CAIC — Consistent Akaike information criterion

To obtain any of these values as a scalar, index into the property using dot notation. For example, in a model mdl, the AIC value aic is:

aic = mdl.ModelCriterion.AIC

NumCoefficientsNumber of model coefficientspositive integer

Number of model coefficients, stored as a positive integer. NumCoefficients includes coefficients that are set to zero when the model terms are rank deficient.

NumEstimatedCoefficientsNumber of estimated coefficientspositive integer

Number of estimated coefficients in the model, stored as a positive integer. NumEstimatedCoefficients does not include coefficients that are set to zero when the model terms are rank deficient. NumEstimatedCoefficients is the degrees of freedom for regression.

NumObservationsNumber of observationspositive integer

Number of observations the fitting function used in fitting, stored as a positive integer. This is the number of observations supplied in the original table, dataset, or matrix, minus any excluded rows (set with the Excluded name-value pair) or rows with missing values.

NumPredictorsNumber of predictor variablespositive integer

Number of predictor variables used to fit the model, stored as a positive integer.

NumVariablesNumber of variablespositive integer

Number of variables in the input data, stored as a positive integer. NumVariables is the number of variables in the original table or dataset, or the total number of columns in the predictor matrix and response vector when the fit is based on those arrays. It includes variables, if any, that are not used as predictors or as the response.

ObservationInfoObservation informationtable

Observation information, stored as a n-by-4 table, where n is equal to the number of rows of input data. The four columns of ObservationInfo contain the following:

FieldDescription
WeightsObservation weights. Default is all 1.
ExcludedLogical value, 1 indicates an observation that you excluded from the fit with the Exclude name-value pair.
MissingLogical value, 1 indicates a missing value in the input. Missing values are not used in the fit.
SubsetLogical value, 1 indicates the observation is not excluded or missing, so is used in the fit.

ObservationNamesObservation namescell array

Observation names, stored as a cell array of strings containing the names of the observations used in the fit.

  • If the fit is based on a table or dataset containing observation names, ObservationNames uses those names.

  • Otherwise, ObservationNames is an empty cell array

OffsetOffset variablenumeric vector

, stored as a numeric vector with the same length as the number of rows in the data. Offset is passed from fitglm or stepwiseglm in the Offset name-value pair. The fitting function used Offset as a predictor variable, but with the coefficient set to exactly 1. In other words, the formula for fitting was

μ ~ Offset + (terms involving real predictors)

with the Offset predictor having coefficient 1.

For example, consider a Poisson regression model. Suppose the number of counts is known for theoretical reasons to be proportional to a predictor A. By using the log link function and by specifying log(A) as an offset, you can force the model to satisfy this theoretical constraint.

PredictorNamesNames of predictors used to fit the modelcell array

Names of predictors used to fit the model, stored as a cell array of strings.

ResidualsResiduals for fitted modeltable

Residuals for the fitted model, stored as a table with one row for each observation and the following columns.

FieldDescription
RawObserved minus fitted values.
LinearPredictorResiduals on the linear predictor scale, equal to the adjusted response value minus the fitted linear combination of the predictors.
PearsonRaw residuals divided by the estimated standard deviation of the response.
AnscombeResiduals defined on transformed data with the transformation chosen to remove skewness.
DevianceResiduals based on the contribution of each observation to the deviance.

To obtain any of these columns as a vector, index into the property using dot notation. For example, in a model mdl, the ordinary raw residual vector r is:

r = mdl.Residuals.Raw

Rows not used in the fit because of missing values (in ObservationInfo.Missing) contain NaN values.

Rows not used in the fit because of excluded values (in ObservationInfo.Excluded) contain NaN values, with the following exceptions:

  • raw contains the difference between the observed and predicted values.

  • standardized is the residual, standardized in the usual way.

  • studentized matches the standardized values because this residual is not used in the estimate of the residual standard deviation.

ResponseNameResponse variable namestring

Response variable name, stored as a string.

RsquaredR-squared value for the modelstructure

R-squared value for the model, stored as a structure.

For a linear or nonlinear model, Rsquared is a structure with two fields:

  • Ordinary — Ordinary (unadjusted) R-squared

  • Adjusted — R-squared adjusted for the number of coefficients

For a generalized linear model, Rsquared is a structure with five fields:

  • Ordinary — Ordinary (unadjusted) R-squared

  • Adjusted — R-squared adjusted for the number of coefficients

  • LLR — Log-likelihood ratio

  • Deviance — Deviance

  • AdjGeneralized — Adjusted generalized R-squared

The R-squared value is the proportion of total sum of squares explained by the model. The ordinary R-squared value relates to the SSR and SST properties:

Rsquared = SSR/SST = 1 - SSE/SST.

To obtain any of these values as a scalar, index into the property using dot notation. For example, the adjusted R-squared value in mdl is

r2 = mdl.Rsquared.Adjusted

SSESum of squared errorsnumeric value

Sum of squared errors (residuals), stored as a numeric value.

The Pythagorean theorem implies

SST = SSE + SSR.

SSRRegression sum of squaresnumeric value

Regression sum of squares, stored as a numeric value. The regression sum of squares is equal to the sum of squared deviations of the fitted values from their mean.

The Pythagorean theorem implies

SST = SSE + SSR.

SSTTotal sum of squaresnumeric value

Total sum of squares, stored as a numeric value. The total sum of squares is equal to the sum of squared deviations of y from mean(y).

The Pythagorean theorem implies

SST = SSE + SSR.

StepsStepwise fitting informationstructure

Stepwise fitting information, stored as a structure with the following fields.

FieldDescription
StartFormula representing the starting model
LowerFormula representing the lower bound model, these terms that must remain in the model
UpperFormula representing the upper bound model, model cannot contain more terms than Upper
CriterionCriterion used for the stepwise algorithm, such as 'sse'
PEnterValue of the parameter, such as 0.05
PRemoveValue of the parameter, such as 0.10
HistoryTable representing the steps taken in the fit

The History table has one row for each step including the initial fit, and the following variables (columns).

FieldDescription
ActionAction taken during this step, one of:
  • 'Start' — First step

  • 'Add' — A term is added

  • 'Remove' — A term is removed

TermName
  • 'Start' step: The starting model specification

  • 'Add' or 'Remove' steps: The term moved in that step

TermsTerms matrix (see modelspec of fitlm)
DFRegression degrees of freedom after this step
delDFChange in regression degrees of freedom from previous step (negative for steps that remove a term)
DevianceDeviance (residual sum of squares) at that step
FStatF statistic that led to this step
PValuep-value of the F statistic

The structure is empty unless you use stepwiselm or stepwiseglm to fit the model.

VariableInfoInformation about input variablestable

Information about input variables contained in Variables, stored as a table with one row for each model term and the following columns.

FieldDescription
ClassString giving variable class, such as 'double'
RangeCell array giving variable range:
  • Continuous variable — Two-element vector [min,max], the minimum and maximum values

  • Categorical variable — Cell array of distinct variable values

InModelLogical vector, where true indicates the variable is in the model
IsCategoricalLogical vector, where true indicates a categorical variable

VariableNamesNames of variables used in fitcell array

Names of variables used in fit, stored as a cell array of strings.

  • If the fit is based on a table or dataset, this property provides the names of the variables in that table or dataset.

  • If the fit is based on a predictor matrix and response vector, VariableNames is the values in the VarNames name-value pair of the fitting method.

  • Otherwise the variables have the default fitting names.

VariablesData used to fit the modeltable

Data used to fit the model, stored as a table. Variables contains both observation and response values. If the fit is based on a table or dataset array, Variables contains all of the data from that table or dataset array. Otherwise, Variables is a table created from the input data matrix X and response vector y.

Methods

addTermsAdd terms to generalized linear model
coefCIConfidence intervals of coefficient estimates of generalized linear model
coefTestLinear hypothesis test on generalized linear regression model coefficients
devianceTestAnalysis of deviance
dispDisplay generalized linear regression model
fevalEvaluate generalized linear regression model prediction
fitCreate generalized linear regression model
plotDiagnosticsPlot diagnostics of generalized linear regression model
plotResidualsPlot residuals of generalized linear regression model
plotSlicePlot of slices through fitted generalized linear regression surface
predictPredict response of generalized linear regression model
randomSimulate responses for generalized linear regression model
removeTermsRemove terms from generalized linear model
stepImprove generalized linear regression model by adding or removing terms
stepwiseCreate generalized linear regression model by stepwise regression

Definitions

Canonical Link Function

The default link function for a generalized linear model is the canonical link function.

Canonical Link Functions for Generalized Linear Models

DistributionLink Function NameLink FunctionMean (Inverse) Function
'normal''identity'f(μ) = μμ = Xb
'binomial''logit'f(μ) = log(μ/(1–μ))μ = exp(Xb) / (1 + exp(Xb))
'poisson''log'f(μ) = log(μ)μ = exp(Xb)
'gamma'-1f(μ) = 1/μμ = 1/(Xb)
'inverse gaussian'-2f(μ) = 1/μ2μ = (Xb)–1/2

Hat Matrix

The hat matrix H is defined in terms of the data matrix X and a diagonal weight matrix W:

H = X(XTWX)–1XTWT.

W has diagonal elements wi:

wi=g(μi)V(μi),

where

  • g is the link function mapping yi to xib.

  • g is the derivative of the link function g.

  • V is the variance function.

  • μi is the ith mean.

The diagonal elements Hii satisfy

0hii1i=1nhii=p,

where n is the number of observations (rows of X), and p is the number of coefficients in the regression model.

Leverage

The leverage of observation i is the value of the ith diagonal term, hii, of the hat matrix H. Because the sum of the leverage values is p (the number of coefficients in the regression model), an observation i can be considered to be an outlier if its leverage substantially exceeds p/n, where n is the number of observations.

Cook's Distance

The Cook's distance Di of observation i is

Di=wiei2pφ^hii(1hii)2,

where

  • φ^ is the dispersion parameter (estimated or theoretical).

  • ei is the linear predictor residual, g(yi)xiβ^, where

    • g is the link function.

    • yi is the observed response.

    • xi is the observation.

    • β^ is the estimated coefficient vector.

  • p is the number of coefficients in the regression model.

  • hii is the ith diagonal element of the Hat Matrix H.

Deviance

Deviance of a model M1 is twice the difference between the loglikelihood of that model and the saturated model, MS. The saturated model is the model with the maximum number of parameters that can be estimated. For example, if there are n observations yi, i = 1, 2, ..., n, with potentially different values for XiTβ, then you can define a saturated model with n parameters. Let L(b,y) denote the maximum value of the likelihood function for a model. Then the deviance of model M1 is

2(logL(b1,y)logL(bS,y)),

where b1 are the estimated parameters for model M1 and bS are the estimated parameters for the saturated model. The deviance has a chi-square distribution with np degrees of freedom, where n is the number of parameters in the saturated model and p is the number of parameters in model M1.

If M1 and M2 are two different generalized linear models, then the fit of the models can be assessed by comparing the deviances D1 and D2 of these models. The difference of the deviances is

D=D2D1=2(logL(b2,y)logL(bS,y))+2(logL(b1,y)logL(bS,y))=2(logL(b2,y)logL(b1,y)).

Asymptotically, this difference has a chi-square distribution with degrees of freedom v equal to the number of parameters that are estimated in one model but fixed (typically at 0) in the other. That is, it is equal to the difference in the number of parameters estimated in M1 and M2. You can get the p-value for this test using 1 - chi2cdf(D,V), where D = D2D1.

Copy Semantics

Value. To learn how value classes affect copy operations, see Copying Objects in the MATLAB® documentation.

Examples

expand all

Fit a Generalized Linear Model

Fit a logistic regression model of probability of smoking as a function of age, weight, and sex, using a two-way interactions model.

Load the hospital dataset array.

load hospital
ds = hospital; % just to use the ds name

Specify the model using a formula that allows up to two-way interactions.

modelspec = 'Smoker ~ Age*Weight*Sex - Age:Weight:Sex';

Create the generalized linear model.

mdl = fitglm(ds,modelspec,'Distribution','binomial')
mdl = 


Generalized Linear regression model:
    logit(Smoker) ~ 1 + Sex*Age + Sex*Weight + Age*Weight
    Distribution = Binomial

Estimated Coefficients:
                        Estimate         SE         tStat      pValue 
                       ___________    _________    ________    _______

    (Intercept)            -6.0492       19.749     -0.3063    0.75938
    Sex_Male               -2.2859       12.424    -0.18399    0.85402
    Age                    0.11691      0.50977     0.22934    0.81861
    Weight                0.031109      0.15208     0.20455    0.83792
    Sex_Male:Age          0.020734      0.20681     0.10025    0.92014
    Sex_Male:Weight        0.01216     0.053168     0.22871     0.8191
    Age:Weight         -0.00071959    0.0038964    -0.18468    0.85348


100 observations, 93 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 5.07, p-value = 0.535

The large $p$-value indicates the model might not differ statistically from a constant.

Create a Generalized Linear Model Stepwise

Create response data using just three of 20 predictors, and create a generalized linear model stepwise to see if it uses just the correct predictors.

Create data with 20 predictors, and Poisson response using just three of the predictors, plus a constant.

rng default % for reproducibility
X = randn(100,20);
mu = exp(X(:,[5 10 15])*[.4;.2;.3] + 1);
y = poissrnd(mu);

Fit a generalized linear model using the Poisson distribution.

mdl =  stepwiseglm(X,y,...
    'constant','upper','linear','Distribution','poisson')
1. Adding x5, Deviance = 134.439, Chi2Stat = 52.24814, PValue = 4.891229e-13
2. Adding x15, Deviance = 106.285, Chi2Stat = 28.15393, PValue = 1.1204e-07
3. Adding x10, Deviance = 95.0207, Chi2Stat = 11.2644, PValue = 0.000790094

mdl = 


Generalized Linear regression model:
    log(y) ~ 1 + x5 + x10 + x15
    Distribution = Poisson

Estimated Coefficients:
                   Estimate       SE       tStat       pValue  
                   ________    ________    ______    __________

    (Intercept)     1.0115     0.064275    15.737    8.4217e-56
    x5             0.39508     0.066665    5.9263    3.0977e-09
    x10            0.18863      0.05534    3.4085     0.0006532
    x15            0.29295     0.053269    5.4995    3.8089e-08


100 observations, 96 error degrees of freedom
Dispersion: 1
Chi^2-statistic vs. constant model: 91.7, p-value = 9.61e-20

Related Examples

Was this topic helpful?