Create generalized linear regression model
Fit a Logistic Regression Model
Make a logistic binomial model of the probability of smoking as a function of age, weight, and sex, using a two-way interactions model.
hospital dataset array.
load hospital dsa = hospital;
Specify the model using a formula that allows up to two-way interactions between the variables age, weight, and sex. Smoker is the response variable.
modelspec = 'Smoker ~ Age*Weight*Sex - Age:Weight:Sex';
Fit a logistic binomial model.
mdl = fitglm(dsa,modelspec,'Distribution','binomial')
mdl = Generalized linear regression model: logit(Smoker) ~ 1 + Sex*Age + Sex*Weight + Age*Weight Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue ___________ _________ ________ _______ (Intercept) -6.0492 19.749 -0.3063 0.75938 Sex_Male -2.2859 12.424 -0.18399 0.85402 Age 0.11691 0.50977 0.22934 0.81861 Weight 0.031109 0.15208 0.20455 0.83792 Sex_Male:Age 0.020734 0.20681 0.10025 0.92014 Sex_Male:Weight 0.01216 0.053168 0.22871 0.8191 Age:Weight -0.00071959 0.0038964 -0.18468 0.85348 100 observations, 93 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 5.07, p-value = 0.535
All of the p-values (under
pValue) are large. This means none of the coefficients are significant. The large -value for the test of the model, 0.535, indicates that this model might not differ statistically from a constant model.
GLM for Poisson Response
Create sample data with 20 predictors, and Poisson response using just three of the predictors, plus a constant.
rng('default') % for reproducibility X = randn(100,7); mu = exp(X(:,[1 3 6])*[.4;.2;.3] + 1); y = poissrnd(mu);
Fit a generalized linear model using the Poisson distribution.
mdl = fitglm(X,y,'linear','Distribution','poisson')
mdl = Generalized linear regression model: log(y) ~ 1 + x1 + x2 + x3 + x4 + x5 + x6 + x7 Distribution = Poisson Estimated Coefficients: Estimate SE tStat pValue _________ ________ ________ __________ (Intercept) 0.88723 0.070969 12.502 7.3149e-36 x1 0.44413 0.052337 8.4858 2.1416e-17 x2 0.0083388 0.056527 0.14752 0.88272 x3 0.21518 0.063416 3.3932 0.00069087 x4 -0.058386 0.065503 -0.89135 0.37274 x5 -0.060824 0.073441 -0.8282 0.40756 x6 0.34267 0.056778 6.0352 1.5878e-09 x7 0.04316 0.06146 0.70225 0.48252 100 observations, 92 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 119, p-value = 1.55e-22
p-values of 2.14e-17, 0.00069, and 1.58e-09 indicate that the coefficients of the variables
x6 are statistically significant.
modelspec — Model specification
'linear' (default) | character vector or string scalar naming the model | t-by-(p + 1) terms matrix | character vector or string scalar formula in the form
Model specification, specified as one of these values.
A character vector or string scalar naming the model.
Value Model Type
Model contains only a constant (intercept) term.
Model contains an intercept and linear term for each predictor.
Model contains an intercept, linear term for each predictor, and all products of pairs of distinct predictors (no squared terms).
Model contains an intercept term and linear and squared terms for each predictor.
Model contains an intercept term, linear and squared terms for each predictor, and all products of pairs of distinct predictors.
Model is a polynomial with all terms up to degree
iin the first predictor, degree
jin the second predictor, and so on. Specify the maximum degree for each predictor by using numerals 0 though 9. The model contains interaction terms, but the degree of each interaction term does not exceed the maximum value of the specified degrees. For example,
'poly13'has an intercept and x1, x2, x22, x23, x1*x2, and x1*x22 terms, where x1 and x2 are the first and second predictors, respectively.
A t-by-(p + 1) matrix, or a Terms Matrix, specifying terms in the model, where t is the number of terms and p is the number of predictor variables, and +1 accounts for the response variable. A terms matrix is convenient when the number of predictors is large and you want to generate the terms programmatically.
A character vector or string scalar Formula in the form
'y ~ terms',
termsare in Wilkinson Notation. The variable names in the formula must be variable names in
tblor variable names specified by
Varnames. Also, the variable names must be valid MATLAB identifiers.
The software determines the order of terms in a fitted model by using the order of terms in
X. Therefore, the order of terms in the model can be different from the order of terms in the specified formula.
Specify optional pairs of arguments as
the argument name and
Value is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name in quotes.
that the distribution of the response is normal, and instructs
use the probit link function and exclude the 23rd and 59th observations
from the fit.
Canonical Link Function
The generalized linear model
mdlis a standard linear model unless you specify otherwise with the
For methods such as
devianceTest, or properties of the
After training a model, you can generate C/C++ code that predicts responses for new data. Generating C/C++ code requires MATLAB Coder™. For details, see Introduction to Code Generation.
fitglmtreats a categorical predictor as follows:
A model with a categorical predictor that has L levels (categories) includes L – 1 indicator variables. The model uses the first category as a reference level, so it does not include the indicator variable for the reference level. If the data type of the categorical predictor is
categorical, then you can check the order of categories by using
categoriesand reorder the categories by using
reordercatsto customize the reference level. For more details about creating indicator variables, see Automatic Creation of Dummy Variables.
fitglmtreats the group of L – 1 indicator variables as a single variable. If you want to treat the indicator variables as distinct predictor variables, create indicator variables manually by using
dummyvar. Then use the indicator variables, except the one corresponding to the reference level of the categorical variable, when you fit a model. For the categorical predictor
X, if you specify all columns of
dummyvar(X)and an intercept term as predictors, then the design matrix becomes rank deficient.
Interaction terms between a continuous predictor and a categorical predictor with L levels consist of the element-wise product of the L – 1 indicator variables with the continuous predictor.
Interaction terms between two categorical predictors with L and M levels consist of the (L – 1)*(M – 1) indicator variables to include all possible combinations of the two categorical predictor levels.
You cannot specify higher-order terms for a categorical predictor because the square of an indicator is equal to itself.
''(empty character vector),
Yto be missing values.
fitglmdoes not use observations with missing values in the fit. The
ObservationInfoproperty of a fitted model indicates whether or not
fitglmuses each observation in the fit.
 Collett, D. Modeling Binary Data. New York: Chapman & Hall, 2002.
 Dobson, A. J. An Introduction to Generalized Linear Models. New York: Chapman & Hall, 1990.
 McCullagh, P., and J. A. Nelder. Generalized Linear Models. New York: Chapman & Hall, 1990.
Calculate with arrays that have more rows than fit in memory.
This function supports tall arrays for out-of-memory data with some limitations.
If any input argument to
fitglmis a tall array, then all of the other inputs must be tall arrays as well. This includes nonempty variables supplied with the
The default number of iterations is 5. You can change the number of iterations using the
'Options'name-value pair to pass in an options structure. Create an options structure using
statsetto specify a different value for
For tall data,
CompactGeneralizedLinearModelobject that contains most of the same properties as a
GeneralizedLinearModelobject. The main difference is that the compact object is sensitive to memory requirements. The compact object does not include properties that include the data, or that include an array of the same size as the data. The compact object does not contain these
You can compute the residuals directly from the compact object returned by
GLM = fitglm(X,Y)using
RES = Y - predict(GLM,X); S = sqrt(GLM.SSE/GLM.DFE); histogram(RES,linspace(-3*S,3*S,51))
For more information, see Tall Arrays for Out-of-Memory Data.
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.
This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).