Create generalized linear regression model
Make a logistic binomial model of the probability of smoking as a function of age, weight, and sex, using a two-way interactions model.
hospital dataset array.
load hospital dsa = hospital;
Specify the model using a formula that allows up to two-way interactions between the variables age, weight, and sex. Smoker is the response variable.
modelspec = 'Smoker ~ Age*Weight*Sex - Age:Weight:Sex';
Fit a logistic binomial model.
mdl = fitglm(dsa,modelspec,'Distribution','binomial')
mdl = Generalized linear regression model: logit(Smoker) ~ 1 + Sex*Age + Sex*Weight + Age*Weight Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue ___________ _________ ________ _______ (Intercept) -6.0492 19.749 -0.3063 0.75938 Sex_Male -2.2859 12.424 -0.18399 0.85402 Age 0.11691 0.50977 0.22934 0.81861 Weight 0.031109 0.15208 0.20455 0.83792 Sex_Male:Age 0.020734 0.20681 0.10025 0.92014 Sex_Male:Weight 0.01216 0.053168 0.22871 0.8191 Age:Weight -0.00071959 0.0038964 -0.18468 0.85348 100 observations, 93 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 5.07, p-value = 0.535
All of the p-values (under pValue) are large. This means none of the coefficients are significant. The large -value for the test of the model, 0.535, indicates that this model might not differ statistically from a constant model.
Create sample data with 20 predictors, and Poisson response using just three of the predictors, plus a constant.
rng('default') % for reproducibility X = randn(100,7); mu = exp(X(:,[1 3 6])*[.4;.2;.3] + 1); y = poissrnd(mu);
Fit a generalized linear model using the Poisson distribution.
mdl = fitglm(X,y,'linear','Distribution','poisson')
mdl = Generalized linear regression model: log(y) ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 Distribution = Poisson Estimated Coefficients: Estimate SE tStat pValue _________ ________ ________ __________ (Intercept) 0.88723 0.070969 12.502 7.3149e-36 x1 0.44413 0.052337 8.4858 2.1416e-17 x2 0.0083388 0.056527 0.14752 0.88272 x3 0.21518 0.063416 3.3932 0.00069087 x4 -0.058386 0.065503 -0.89135 0.37274 x5 -0.060824 0.073441 -0.8282 0.40756 x6 0.34267 0.056778 6.0352 1.5878e-09 x7 0.04316 0.06146 0.70225 0.48252 100 observations, 92 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 119, p-value = 1.55e-22
p-values of 2.14e-17, 0.00069, and 1.58e-09 indicate that the coefficients of the variables
x3, and |x6|are statistically significant.
modelspec— Model specification
'linear'(default) | character vector or string scalar naming the model | t-by-(p + 1) terms matrix | character vector or string scalar formula in the form
'Y ~ terms'
Model specification, specified as one of these values.
A character vector or string scalar naming the model.
|Model contains only a constant (intercept) term.|
|Model contains an intercept and linear term for each predictor.|
|Model contains an intercept, linear term for each predictor, and all products of pairs of distinct predictors (no squared terms).|
|Model contains an intercept term and linear and squared terms for each predictor.|
|Model contains an intercept term, linear and squared terms for each predictor, and all products of pairs of distinct predictors.|
|Model is a polynomial with all terms up to degree |
A t-by-(p + 1) matrix, or a Terms Matrix, specifying terms in the model, where t is the number of terms and p is the number of predictor variables, and +1 accounts for the response variable. A terms matrix is convenient when the number of predictors is large and you want to generate the terms programmatically.
A character vector or string scalar representing a Formula in the form
'Y ~ terms',
terms are in Wilkinson Notation. The variable names in the
formula must be valid MATLAB identifiers.
comma-separated pairs of
the argument name and
Value is the corresponding value.
Name must appear inside quotes. You can specify several name and value
pair arguments in any order as
'Distribution','normal','link','probit','Exclude',[23,59]specifies that the distribution of the response is normal, and instructs
fitglmto use the probit link function and exclude the 23rd and 59th observations from the fit.
The generalized linear model
a standard linear model unless you specify otherwise with the
For methods such as
or properties of the
After training a model, you can generate C/C++ code that predicts responses for new data. Generating C/C++ code requires MATLAB Coder™. For details, see Introduction to Code Generation.
fitglm treats a categorical predictor as follows:
A model with a categorical predictor that has L levels
(categories) includes L – 1 indicator variables. The model uses the first category as a
reference level, so it does not include the indicator variable for the reference
level. If the data type of the categorical predictor is
categorical, then you can check the order of categories
categories and reorder the
categories by using
reordercats to customize the
fitglm treats the group of L – 1 indicator variables as a single variable. If you want to treat
the indicator variables as distinct predictor variables, create indicator
variables manually by using
dummyvar. Then use the
indicator variables, except the one corresponding to the reference level of the
categorical variable, when you fit a model. For the categorical predictor
X, if you specify all columns of
dummyvar(X) and an intercept term as predictors, then the
design matrix becomes rank deficient.
Interaction terms between a continuous predictor and a categorical predictor with L levels consist of the element-wise product of the L – 1 indicator variables with the continuous predictor.
Interaction terms between two categorical predictors with L and M levels consist of the (L – 1)*(M – 1) indicator variables to include all possible combinations of the two categorical predictor levels.
You cannot specify higher-order terms for a categorical predictor because the square of an indicator is equal to itself.
'' (empty character vector),
"" (empty string),
<undefined> values in
Y to be missing values.
fitglm does not use observations with missing values in the fit.
ObservationInfo property of a fitted model indicates whether or not
fitglm uses each observation in the fit.
 Collett, D. Modeling Binary Data. New York: Chapman & Hall, 2002.
 Dobson, A. J. An Introduction to Generalized Linear Models. New York: Chapman & Hall, 1990.
 McCullagh, P., and J. A. Nelder. Generalized Linear Models. New York: Chapman & Hall, 1990.
This function supports tall arrays for out-of-memory data with some limitations.
If any input argument to
a tall array, then all of the other inputs must be tall arrays as
well. This includes nonempty variables supplied with the
'BinomialSize' name-value pairs.
The default number of iterations is 5. You can change
the number of iterations using the
pair to pass in an options structure. Create an options structure
statset to specify a different value for
For tall data,
CompactGeneralizedLinearModel object that contains
most of the same properties as a
The main difference is that the compact object is sensitive to memory
requirements. The compact object does not include properties that
include the data, or that include an array of the same size as the
data. The compact object does not contain these
You can compute the residuals directly from the compact
object returned by
GLM = fitglm(X,Y) using
RES = Y - predict(GLM,X); S = sqrt(GLM.SSE/GLM.DFE); histogram(RES,linspace(-3*S,3*S,51))
For more information, see Tall Arrays for Out-of-Memory Data (MATLAB).