MathWorks Machine Translation
The automated translation of this page is provided by a general purpose third party translator tool.
MathWorks does not warrant, and disclaims all liability for, the accuracy, suitability, or fitness for purpose of the translation.
Superclasses: CompactGeneralizedLinearModel
Generalized linear regression model class
An object comprising training data, model description, diagnostic
information, and fitted coefficients for a generalized linear regression.
Predict model responses with the predict
or feval
methods.
or mdl
=
fitglm(tbl
)
creates
a generalized linear model of a table or dataset array mdl
=
fitglm(X
,y
)tbl
,
or of the responses y
to a data matrix X
.
For details, see fitglm
.
or mdl
= stepwiseglm(tbl
)
creates
a generalized linear model of a table or dataset array mdl
=
stepwiseglm(X
,y
)tbl
,
or of the responses y
to a data matrix X
,
with unimportant predictors excluded. For details, see stepwiseglm
.
tbl
— Input dataInput data, specified as a table or dataset array. When modelspec
is
a formula
, it specifies the variables to be used
as the predictors and response. Otherwise, if you do not specify the
predictor and response variables, the last variable is the response
variable and the others are the predictor variables by default.
Predictor variables can be numeric, or any grouping variable type, such as logical or categorical (see Grouping Variables). The response must be numeric or logical.
To set a different column as the response variable, use the ResponseVar
namevalue
pair argument. To use a subset of the columns as predictors, use the PredictorVars
namevalue
pair argument.
X
— Predictor variablesPredictor variables, specified as an nbyp matrix,
where n is the number of observations and p is
the number of predictor variables. Each column of X
represents
one variable, and each row represents one observation.
By default, there is a constant term in the model, unless you
explicitly remove it, so do not include a column of 1s in X
.
Data Types: single
 double
 logical
y
— Response variableResponse variable, specified as an nby1
vector, where n is the number of observations.
Each entry in y
is the response for the corresponding
row of X
.
Data Types: single
 double
 logical
CoefficientCovariance
— Covariance matrix of coefficient estimatesCovariance matrix of coefficient estimates, stored as a pbyp matrix of numeric values. p is the number of coefficients in the fitted model.
CoefficientNames
— Coefficient namesCoefficient names, stored as a cell array of character vectors containing a label for each coefficient.
Coefficients
— Coefficient valuesCoefficient values, stored as a table. Coefficients
has
one row for each coefficient and the following columns:
Estimate
— Estimated coefficient
value
SE
— Standard error of the
estimate
tStat
— t statistic
for a test that the coefficient is zero
pValue
— pvalue
for the t statistic
To obtain any of these columns as a vector, index into the property
using dot notation. For example, in mdl
the estimated
coefficient vector is
beta = mdl.Coefficients.Estimate
Use coefTest
to perform other tests on the
coefficients.
Deviance
— Deviance of the fitDeviance of the fit, stored as a numeric value. Deviance is useful for comparing two models when one is a special case of the other. The difference between the deviance of the two models has a chisquare distribution with degrees of freedom equal to the difference in the number of estimated parameters between the two models. For more information on deviance, see Deviance.
DFE
— Degrees of freedom for errorDegrees of freedom for error (residuals), equal to the number of observations minus the number of estimated coefficients, stored as a positive integer value.
Diagnostics
— Diagnostic informationDiagnostic information for the model, stored as a table. Diagnostics
can help identify outliers and influential observations. Diagnostics
contains
the following fields:
Field  Meaning  Utility 

Leverage  Diagonal elements of HatMatrix  Leverage indicates to what extent the predicted value for an
observation is determined by the observed value for that observation.
A value close to 1 indicates that the prediction
is largely determined by that observation, with little contribution
from the other observations. A value close to 0 indicates
the fit is largely determined by the other observations. For a model
with p coefficients and n observations,
the average value of Leverage is p/n.
An observation with Leverage larger than 2*p/n can
be an outlier. 
CooksDistance  Cook's measure of scaled change in fitted values  CooksDistance is a measure of scaled change
in fitted values. An observation with CooksDistance larger
than three times the mean Cook's distance can be an outlier. 
HatMatrix  Projection matrix to compute fitted from observed responses  HatMatrix is an nbyn matrix
such that Fitted = HatMatrix*Y ,
where Y is the response vector and Fitted is
the vector of fitted response values. 
All of these quantities are computed on the scale of the linear predictor. So, for example, in the equation that defines the hat matrix,
Yfit = glm.Fitted.LinearPredictor Y = glm.Fitted.LinearPredictor + glm.Residuals.LinearPredictor
Dispersion
— Scale factor of the variance of the responseScale factor of the variance of the response, stored as a structure. Dispersion
multiplies
the variance function for the distribution.
For example, the variance function for the binomial distribution
is p(1–p)/n,
where p is the probability parameter and n is
the sample size parameter. If Dispersion
is near 1
,
the variance of the data appears to agree with the theoretical variance
of the binomial distribution. If Dispersion
is
larger than 1
, the data are “overdispersed”
relative to the binomial distribution.
DispersionEstimated
— Flag to indicate use of dispersion scale factorFlag to indicate whether fitglm
used
the Dispersion
scale factor to compute standard
errors for the coefficients in Coefficients.SE
,
stored as a logical value. If DispersionEstimated
is false
, fitglm
used
the theoretical value of the variance.
DispersionEstimated
can be false
only
for 'binomial'
or 'poisson'
distributions.
Set DispersionEstimated
by setting
the DispersionFlag
namevalue pair in fitglm
.
Distribution
— Generalized distribution informationGeneralized distribution information, stored as a structure with the following fields relating to the generalized distribution:
Field  Description 

Name  Name of the distribution, one of 'normal' , 'binomial' , 'poisson' , 'gamma' ,
or 'inverse gamma' . 
DevianceFunction  Function that computes the components of the deviance as a function of the fitted parameter values and the response values. 
VarianceFunction  Function that computes the theoretical variance for the distribution
as a function of the fitted parameter values. When DispersionEstimated
is true , Dispersion multiplies
the variance function in the computation of the coefficient standard
errors. 
Fitted
— Fitted response values based on input dataFitted (predicted) values based on the input data, stored as a table with one row for each observation and the following columns.
Field  Description 

Response  Predicted values on the scale of the response. 
LinearPredictor  Predicted values on the scale of the linear predictor. These
are the same as the link function applied to the Response fitted
values. 
Probability  Fitted probabilities (this column is included only with the binomial distribution). 
To obtain any of the columns as a vector, index into the property
using dot notation. For example, in the model mdl
,
the vector f
of fitted values on the response scale
is
f = mdl.Fitted.Response
Use predict
to compute predictions for other
predictor values, or to compute confidence bounds on Fitted
.
Formula
— Model informationLinearFormula
object  NonLinearFormula
objectModel information, stored as a LinearFormula
object
or NonLinearFormula
object. If you fit a linear
or generalized linear regression model, then Formula
is
a LinearFormula
object. If you fit a nonlinear
regression model, then Formula
is a NonLinearFormula
object.
Link
— Link functionLink function, stored as a structure with the following fields:
Field  Description 

Name  Name of the link function, or '' if you
specified the link as a function handle rather than a character vector. 
LinkFunction  The function that defines f, a function handle. 
DevianceFunction  Derivative of f, a function handle. 
VarianceFunction  Inverse of f, a function handle. 
The link is a function f that links the distribution parameter μ to the fitted linear combination Xb of the predictors:
f(μ) = Xb.
LogLikelihood
— Log likelihoodLog likelihood of the model distribution at the response values, stored as a numeric value. The mean is fitted from the model, and other parameters are estimated as part of the model fit.
ModelCriterion
— Criterion for model comparisonCriterion for model comparison, stored as a structure with the following fields:
AIC
— Akaike information
criterion
AICc
— Akaike information
criterion corrected for sample size
BIC
— Bayesian information
criterion
CAIC
— Consistent Akaike
information criterion
To obtain any of these values as a scalar, index into the property
using dot notation. For example, in a model mdl
,
the AIC value aic
is:
aic = mdl.ModelCriterion.AIC
NumCoefficients
— Number of model coefficientsNumber of model coefficients, stored as a positive integer. NumCoefficients
includes
coefficients that are set to zero when the model terms are rank deficient.
NumEstimatedCoefficients
— Number of estimated coefficientsNumber of estimated coefficients in the model, stored as a positive
integer. NumEstimatedCoefficients
does not include
coefficients that are set to zero when the model terms are rank deficient. NumEstimatedCoefficients
is
the degrees of freedom for regression.
NumObservations
— Number of observationsNumber of observations the fitting function used in fitting,
stored as a positive integer. This is the number of observations supplied
in the original table, dataset, or matrix, minus any excluded rows
(set with the Excluded
namevalue pair) or rows
with missing values.
NumPredictors
— Number of predictor variablesNumber of predictor variables used to fit the model, stored as a positive integer.
NumVariables
— Number of variablesNumber of variables in the input data, stored as a positive
integer. NumVariables
is the number of variables
in the original table or dataset, or the total number of columns in
the predictor matrix and response vector when the fit is based on
those arrays. It includes variables, if any, that are not used as
predictors or as the response.
ObservationInfo
— Observation informationObservation information, stored as a nby4
table, where n is equal to the number of rows of
input data. The four columns of ObservationInfo
contain
the following:
Field  Description 

Weights  Observation weights. Default is all 1 . 
Excluded  Logical value, 1 indicates an observation
that you excluded from the fit with the Exclude namevalue
pair. 
Missing  Logical value, 1 indicates a missing value
in the input. Missing values are not used in the fit. 
Subset  Logical value, 1 indicates the observation
is not excluded or missing, so is used in the fit. 
ObservationNames
— Observation namesObservation names, stored as a cell array of character vectors containing the names of the observations used in the fit.
If the fit is based on a table or dataset containing
observation names, ObservationNames
uses those
names.
Otherwise, ObservationNames
is
an empty cell array
Offset
— Offset variableOffset variable, stored as a numeric vector with the same length
as the number of rows in the data. Offset
is passed
from fitglm
or stepwiseglm
in the Offset
namevalue
pair. The fitting function used Offset
as a predictor
variable, but with the coefficient set to exactly 1
.
In other words, the formula for fitting was
μ ~ Offset + (terms
involving real predictors)
with the Offset
predictor having coefficient 1
.
For example, consider a Poisson regression model. Suppose the
number of counts is known for theoretical reasons to be proportional
to a predictor A
. By using the log link function
and by specifying log(A)
as an offset, you can
force the model to satisfy this theoretical constraint.
PredictorNames
— Names of predictors used to fit the modelNames of predictors used to fit the model, stored as a cell array of character vectors.
Residuals
— Residuals for fitted modelResiduals for the fitted model, stored as a table with one row for each observation and the following columns.
Field  Description 

Raw  Observed minus fitted values. 
LinearPredictor  Residuals on the linear predictor scale, equal to the adjusted response value minus the fitted linear combination of the predictors. 
Pearson  Raw residuals divided by the estimated standard deviation of the response. 
Anscombe  Residuals defined on transformed data with the transformation chosen to remove skewness. 
Deviance  Residuals based on the contribution of each observation to the deviance. 
To obtain any of these columns as a vector, index into the property
using dot notation. For example, in a model mdl
,
the ordinary raw residual vector r
is:
r = mdl.Residuals.Raw
Rows not used in the fit because of missing values (in ObservationInfo.Missing
)
contain NaN
values.
Rows not used in the fit because of excluded values (in ObservationInfo.Excluded
)
contain NaN
values, with the following exceptions:
raw
contains the difference between
the observed and predicted values.
standardized
is the residual, standardized
in the usual way.
studentized
matches the standardized
values because this residual is not used in the estimate of the residual
standard deviation.
ResponseName
— Response variable nameResponse variable name, stored as a character vector.
Rsquared
— Rsquared value for the modelRsquared value for the model, stored as a structure.
For a linear or nonlinear model, Rsquared
is
a structure with two fields:
Ordinary
— Ordinary (unadjusted)
Rsquared
Adjusted
— Rsquared adjusted
for the number of coefficients
For a generalized linear model, Rsquared
is
a structure with five fields:
Ordinary
— Ordinary (unadjusted)
Rsquared
Adjusted
— Rsquared adjusted
for the number of coefficients
LLR
— Loglikelihood ratio
Deviance
— Deviance
AdjGeneralized
— Adjusted
generalized Rsquared
The Rsquared value is the proportion of total sum of squares
explained by the model. The ordinary Rsquared value relates to the SSR
and SST
properties:
Rsquared = SSR/SST = 1  SSE/SST
.
To obtain any of these values as a scalar, index into the property
using dot notation. For example, the adjusted Rsquared value in mdl
is
r2 = mdl.Rsquared.Adjusted
SSE
— Sum of squared errorsSum of squared errors (residuals), stored as a numeric value.
The Pythagorean theorem implies
SST = SSE + SSR
.
SSR
— Regression sum of squaresRegression sum of squares, stored as a numeric value. The regression sum of squares is equal to the sum of squared deviations of the fitted values from their mean.
The Pythagorean theorem implies
SST = SSE + SSR
.
SST
— Total sum of squaresTotal sum of squares, stored as a numeric value. The total sum
of squares is equal to the sum of squared deviations of y
from mean(y)
.
The Pythagorean theorem implies
SST = SSE + SSR
.
Steps
— Stepwise fitting informationStepwise fitting information, stored as a structure with the following fields.
Field  Description 

Start  Formula representing the starting model 
Lower  Formula representing the lower bound model, these terms that must remain in the model 
Upper  Formula representing the upper bound model, model cannot contain
more terms than Upper 
Criterion  Criterion used for the stepwise algorithm, such as 'sse' 
PEnter  Value of the parameter, such as 0.05 
PRemove  Value of the parameter, such as 0.10 
History  Table representing the steps taken in the fit 
The History
table has one row for each step
including the initial fit, and the following variables (columns).
Field  Description 

Action  Action taken during this step, one of:

TermName 

Terms  Terms matrix (see modelspec of fitlm ) 
DF  Regression degrees of freedom after this step 
delDF  Change in regression degrees of freedom from previous step (negative for steps that remove a term) 
Deviance  Deviance (residual sum of squares) at that step 
FStat  F statistic that led to this step 
PValue  pvalue of the F statistic 
The structure is empty unless you use stepwiselm
or stepwiseglm
to fit the model.
VariableInfo
— Information about input variablesInformation about input variables contained in Variables
,
stored as a table with one row for each model term and the following
columns.
Field  Description 

Class  Character vector giving variable class, such as 'double' 
Range  Cell array giving variable range:

InModel  Logical vector, where true indicates the
variable is in the model 
IsCategorical  Logical vector, where true indicates a categorical
variable 
VariableNames
— Names of variables used in fitNames of variables used in fit, stored as a cell array of character vectors.
If the fit is based on a table or dataset, this property provides the names of the variables in that table or dataset.
If the fit is based on a predictor matrix and response
vector, VariableNames
is the values in the VarNames
namevalue
pair of the fitting method.
Otherwise the variables have the default fitting names.
Variables
— Data used to fit the modelData used to fit the model, stored as a table. Variables
contains
both observation and response values. If the fit is based on a table
or dataset array, Variables
contains all of the
data from that table or dataset array. Otherwise, Variables
is
a table created from the input data matrix X
and
response vector y
.
addTerms  Add terms to generalized linear model 
compact  Compact generalized linear regression model 
fit  Create generalized linear regression model 
plotDiagnostics  Plot diagnostics of generalized linear regression model 
plotResiduals  Plot residuals of generalized linear regression model 
removeTerms  Remove terms from generalized linear model 
step  Improve generalized linear regression model by adding or removing terms 
stepwise  Create generalized linear regression model by stepwise regression 
coefCI  Confidence intervals of coefficient estimates of generalized linear model 
coefTest  Linear hypothesis test on generalized linear regression model coefficients 
devianceTest  Analysis of deviance 
disp  Display generalized linear regression model 
feval  Evaluate generalized linear regression model prediction 
plotSlice  Plot of slices through fitted generalized linear regression surface 
predict  Predict response of generalized linear regression model 
random  Simulate responses for generalized linear regression model 
Value. To learn how value classes affect copy operations, see Copying Objects (MATLAB).
Fit a logistic regression model of probability of smoking as a function of age, weight, and sex, using a twoway interactions model.
Load the hospital
dataset array.
load hospital ds = hospital; % just to use the ds name
Specify the model using a formula that allows up to twoway interactions.
modelspec = 'Smoker ~ Age*Weight*Sex  Age:Weight:Sex';
Create the generalized linear model.
mdl = fitglm(ds,modelspec,'Distribution','binomial')
mdl = Generalized linear regression model: logit(Smoker) ~ 1 + Sex*Age + Sex*Weight + Age*Weight Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue ___________ _________ ________ _______ (Intercept) 6.0492 19.749 0.3063 0.75938 Sex_Male 2.2859 12.424 0.18399 0.85402 Age 0.11691 0.50977 0.22934 0.81861 Weight 0.031109 0.15208 0.20455 0.83792 Sex_Male:Age 0.020734 0.20681 0.10025 0.92014 Sex_Male:Weight 0.01216 0.053168 0.22871 0.8191 Age:Weight 0.00071959 0.0038964 0.18468 0.85348 100 observations, 93 error degrees of freedom Dispersion: 1 Chi^2statistic vs. constant model: 5.07, pvalue = 0.535
The large value indicates the model might not differ statistically from a constant.
Create response data using just three of 20 predictors, and create a generalized linear model stepwise to see if it uses just the correct predictors.
Create data with 20 predictors, and Poisson response using just three of the predictors, plus a constant.
rng default % for reproducibility X = randn(100,20); mu = exp(X(:,[5 10 15])*[.4;.2;.3] + 1); y = poissrnd(mu);
Fit a generalized linear model using the Poisson distribution.
mdl = stepwiseglm(X,y,... 'constant','upper','linear','Distribution','poisson')
1. Adding x5, Deviance = 134.439, Chi2Stat = 52.24814, PValue = 4.891229e13 2. Adding x15, Deviance = 106.285, Chi2Stat = 28.15393, PValue = 1.1204e07 3. Adding x10, Deviance = 95.0207, Chi2Stat = 11.2644, PValue = 0.000790094 mdl = Generalized linear regression model: log(y) ~ 1 + x5 + x10 + x15 Distribution = Poisson Estimated Coefficients: Estimate SE tStat pValue ________ ________ ______ __________ (Intercept) 1.0115 0.064275 15.737 8.4217e56 x5 0.39508 0.066665 5.9263 3.0977e09 x10 0.18863 0.05534 3.4085 0.0006532 x15 0.29295 0.053269 5.4995 3.8089e08 100 observations, 96 error degrees of freedom Dispersion: 1 Chi^2statistic vs. constant model: 91.7, pvalue = 9.61e20
The default link function for a generalized linear model is the canonical link function.
Canonical Link Functions for Generalized Linear Models
Distribution  Link Function Name  Link Function  Mean (Inverse) Function 

'normal'  'identity'  f(μ) = μ  μ = Xb 
'binomial'  'logit'  f(μ) = log(μ/(1–μ))  μ = exp(Xb) / (1 + exp(Xb)) 
'poisson'  'log'  f(μ) = log(μ)  μ = exp(Xb) 
'gamma'  1  f(μ) = 1/μ  μ = 1/(Xb) 
'inverse gaussian'  2  f(μ) = 1/μ^{2}  μ = (Xb)^{–1/2} 
The hat matrix H is defined in terms of the data matrix X and a diagonal weight matrix W:
H = X(X^{T}WX)^{–1}X^{T}W^{T}.
W has diagonal elements w_{i}:
$${w}_{i}=\frac{{g}^{\prime}\left({\mu}_{i}\right)}{\sqrt{V\left({\mu}_{i}\right)}},$$
where
g is the link function mapping y_{i} to x_{i}b.
$${g}^{\prime}$$ is the derivative of the link function g.
V is the variance function.
μ_{i} is the ith mean.
The diagonal elements H_{ii} satisfy
$$\begin{array}{l}0\le {h}_{ii}\le 1\\ {\displaystyle \sum _{i=1}^{n}{h}_{ii}}=p,\end{array}$$
where n is the number of observations (rows of X), and p is the number of coefficients in the regression model.
The leverage of observation i is the value of the ith diagonal term, h_{ii}, of the hat matrix H. Because the sum of the leverage values is p (the number of coefficients in the regression model), an observation i can be considered to be an outlier if its leverage substantially exceeds p/n, where n is the number of observations.
The Cook’s distance D_{i} of observation i is
$${D}_{i}={w}_{i}\frac{{e}_{i}^{2}}{p\widehat{\phi}}\frac{{h}_{ii}}{{\left(1{h}_{ii}\right)}^{2}},$$
where
$$\widehat{\phi}$$ is the dispersion parameter (estimated or theoretical).
e_{i} is the linear predictor residual, $$g\left({y}_{i}\right){x}_{i}\widehat{\beta}$$, where
g is the link function.
y_{i} is the observed response.
x_{i} is the observation.
$$\widehat{\beta}$$ is the estimated coefficient vector.
p is the number of coefficients in the regression model.
h_{ii} is the ith diagonal element of the Hat Matrix H.
Deviance of a model M_{1} is twice the difference between the loglikelihood of that model and the saturated model, M_{S}. The saturated model is the model with the maximum number of parameters that can be estimated. For example, if there are n observations y_{i}, i = 1, 2, ..., n, with potentially different values for X_{i}^{T}β, then you can define a saturated model with n parameters. Let L(b,y) denote the maximum value of the likelihood function for a model. Then the deviance of model M_{1} is
$$2\left(\mathrm{log}L\left({b}_{1},y\right)\mathrm{log}L\left({b}_{S},y\right)\right),$$
where b_{1} are the estimated parameters for model M_{1} and b_{S} are the estimated parameters for the saturated model. The deviance has a chisquare distribution with n – p degrees of freedom, where n is the number of parameters in the saturated model and p is the number of parameters in model M_{1}.
If M_{1} and M_{2} are two different generalized linear models, then the fit of the models can be assessed by comparing the deviances D_{1} and D_{2} of these models. The difference of the deviances is
$$\begin{array}{l}D={D}_{2}{D}_{1}=2\left(\mathrm{log}L\left({b}_{2},y\right)\mathrm{log}L\left({b}_{S},y\right)\right)+2\left(\mathrm{log}L\left({b}_{1},y\right)\mathrm{log}L\left({b}_{S},y\right)\right)\\ \text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}=2\left(\mathrm{log}L\left({b}_{2},y\right)\mathrm{log}L\left({b}_{1},y\right)\right).\end{array}$$
Asymptotically, this difference has a chisquare distribution
with degrees of freedom v equal to the number of
parameters that are estimated in one model but fixed (typically at
0) in the other. That is, it is equal to the difference in the number
of parameters estimated in M_{1} and M_{2}.
You can get the pvalue for this test using 1  chi2cdf(D,V)
, where D = D_{2} – D_{1}.
Usage notes and limitations:
Only the predict
and random
functions support code generation.
When you fit a model by using fitglm
or stepwiseglm
, the following restrictions apply.
Code generation does
not support categorical predictors. You cannot supply training data in a table that contains at
least one categorical predictor, and you cannot use the
'CategoricalVars'
namevalue pair argument. To dummycode variables that you want
treated as categorical, preprocess the categorical data by using dummyvar
before fitting the model..
The Link
, Derivative
, and
Inverse
fields of the 'Link'
namevalue pair argument cannot be anonymous functions. That is, you
cannot generate code using a generalized linear model that was created
using anonymous functions for links. Instead, declare functions for link
components.
LinearModel
 NonLinearModel
 fitglm
 plotPartialDependence
 stepwiseglm
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
You can also select a location from the following list: