Generalized linear regression model class
mdl = stepwiseglm(tbl) or mdl = stepwiseglm(X,y) creates a generalized linear model of a table or dataset array tbl, or of the responses y to a data matrix X, with unimportant predictors excluded. For details, see stepwiseglm.
Input data, specified as a table or dataset array. When modelspec is a formula, it specifies the variables to be used as the predictors and response. Otherwise, if you do not specify the predictor and response variables, the last variable is the response variable and the others are the predictor variables by default.
Predictor variables can be numeric, or any grouping variable type, such as logical or categorical (see Grouping Variables). The response must be numeric or logical.
To set a different column as the response variable, use the ResponseVar name-value pair argument. To use a subset of the columns as predictors, use the PredictorVars name-value pair argument.
Data Types: single | double | logical
Predictor variables, specified as an n-by-p matrix, where n is the number of observations and p is the number of predictor variables. Each column of X represents one variable, and each row represents one observation.
By default, there is a constant term in the model, unless you explicitly remove it, so do not include a column of 1s in X.
Data Types: single | double | logical
Response variable, specified as an n-by-1 vector, where n is the number of observations. Each entry in y is the response for the corresponding row of X.
Data Types: single | double
Covariance matrix of coefficient estimates.
Cell array of strings containing a label for each coefficient.
Coefficient values stored as a table. Coefficients has one row for each coefficient and these columns:
To obtain any of these columns as a vector, index into the property using dot notation. For example, in mdl the estimated coefficient vector is
beta = mdl.Coefficients.Estimate
Use coefTest to perform other tests on the coefficients.
Deviance of the fit. It is useful for comparing two models when one is a special case of the other. The difference between the deviance of the two models has a chi-square distribution with degrees of freedom equal to the difference in the number of estimated parameters between the two models. For more information on deviance, see Deviance.
Degrees of freedom for error (residuals), equal to the number of observations minus the number of estimated coefficients.
Table with diagnostics helpful in finding outliers and influential observations. The table contains the following fields:
All of these quantities are computed on the scale of the linear predictor. So, for example, in the equation that defines the hat matrix,
Yfit = glm.Fitted.LinearPredictor Y = glm.Fitted.LinearPredictor + glm.Residuals.LinearPredictor
Scale factor of the variance of the response. Dispersion multiplies the variance function for the distribution.
For example, the variance function for the binomial distribution is p(1–p)/n, where p is the probability parameter and n is the sample size parameter. If Dispersion is near 1, the variance of the data appears to agree with the theoretical variance of the binomial distribution. If Dispersion is larger than 1, the data are "overdispersed" relative to the binomial distribution.
Logical value indicating whether fitglm used the Dispersion property to compute standard errors for the coefficients in Coefficients.SE. If DispersionEstimated is false, fitglm used the theoretical value of the variance.
Structure with the following fields relating to the generalized distribution:
Table of predicted (fitted) values based on the training data, a table with one row for each observation and the following columns.
To obtain any of the columns as a vector, index into the property using dot notation. For example, in the model mdl, the vector f of fitted values on the response scale is
f = mdl.Fitted.Response
Use predict to compute predictions for other predictor values, or to compute confidence bounds on Fitted.
Object containing information about the model.
Structure with fields relating to the link function. The link is a function f that links the distribution parameter μ to the fitted linear combination Xb of the predictors:
f(μ) = Xb.
The structure has the following fields.
Log likelihood of the model distribution at the response values, with mean fitted from the model, and other parameters estimated as part of the model fit.
AIC and other information criteria for comparing models. A structure with fields:
To obtain any of these values as a scalar, index into the property using dot notation. For example, in a model mdl, the AIC value aic is:
aic = mdl.ModelCriterion.AIC
Number of coefficients in the model, a positive integer. NumCoefficients includes coefficients that are set to zero when the model terms are rank deficient.
Number of estimated coefficients in the model, a positive integer. NumEstimatedCoefficients does not include coefficients that are set to zero when the model terms are rank deficient. NumEstimatedCoefficients is the degrees of freedom for regression.
Number of observations the fitting function used in fitting. This is the number of observations supplied in the original table, dataset, or matrix, minus any excluded rows (set with the Excluded name-value pair) or rows with missing values.
Number of variables fitlm used as predictors for fitting.
Number of variables in the data. NumVariables is the number of variables in the original table or dataset, or the total number of columns in the predictor matrix and response vector when the fit is based on those arrays. It includes variables, if any, that are not used as predictors or as the response.
Table with the same number of rows as the input data (tbl or X).
Cell array of strings containing the names of the observations used in the fit.
Vector with the same length as the number of rows in the data, passed from fitglm or stepwiseglm in the Offset name-value pair. The fitting function used Offset as a predictor variable, but with the coefficient set to exactly 1. In other words, the formula for fitting was
μ ~ Offset + (terms involving real predictors)
with the Offset predictor having coefficient 1.
For example, consider a Poisson regression model. Suppose the number of counts is known for theoretical reasons to be proportional to a predictor A. By using the log link function and by specifying log(A) as an offset, you can force the model to satisfy this theoretical constraint.
Cell array of strings, the names of the predictors used in fitting the model.
Table containing residuals, with one row for each observation and these variables.
To obtain any of these columns as a vector, index into the property using dot notation. For example, in a model mdl, the ordinary raw residual vector r is:
r = mdl.Residuals.Raw
Rows not used in the fit because of missing values (in ObservationInfo.Missing) contain NaN values.
Rows not used in the fit because of excluded values (in ObservationInfo.Excluded) contain NaN values, with the following exceptions:
String giving naming the response variable.
Proportion of total sum of squares explained by the model. The ordinary R-squared value relates to the SSR and SST properties:
Rsquared = SSR/SST = 1 - SSE/SST.
For a linear or nonlinear model, Rsquared is a structure with two fields:
For a generalized linear model, Rsquared is a structure with five fields:
To obtain any of these values as a scalar, index into the property using dot notation. For example, the adjusted R-squared value in mdl is
r2 = mdl.Rsquared.Adjusted
Sum of squared errors (residuals).
The Pythagorean theorem implies
SST = SSE + SSR.
Regression sum of squares, the sum of squared deviations of the fitted values from their mean.
The Pythagorean theorem implies
SST = SSE + SSR.
Total sum of squares, the sum of squared deviations of y from mean(y).
The Pythagorean theorem implies
SST = SSE + SSR.
Structure that is empty unless stepwiselm constructed the model.
The History table has one row for each step including the initial fit, and the following variables (columns).
Table containing metadata about Variables. There is one row for each term in the model, and the following columns.
Cell array of strings containing names of the variables in the fit.
Table containing the data, both observations and responses, that the fitting function used to construct the fit. If the fit is based on a table or dataset array, Variables contains all of the data from that table or dataset array. Otherwise, Variables is a table created from the input data matrix X and response vector y.
|addTerms||Add terms to generalized linear model|
|coefCI||Confidence intervals of coefficient estimates of generalized linear model|
|coefTest||Linear hypothesis test on generalized linear regression model coefficients|
|devianceTest||Analysis of deviance|
|disp||Display generalized linear regression model|
|feval||Evaluate generalized linear regression model prediction|
|fit||Create generalized linear regression model|
|plotDiagnostics||Plot diagnostics of generalized linear regression model|
|plotResiduals||Plot residuals of generalized linear regression model|
|plotSlice||Plot of slices through fitted generalized linear regression surface|
|predict||Predict response of generalized linear regression model|
|random||Simulate responses for generalized linear regression model|
|removeTerms||Remove terms from generalized linear model|
|step||Improve generalized linear regression model by adding or removing terms|
|stepwise||Create generalized linear regression model by stepwise regression|
The default link function for a generalized linear model is the canonical link function.
Canonical Link Functions for Generalized Linear Models
|Distribution||Link Function Name||Link Function||Mean (Inverse) Function|
|'normal'||'identity'||f(μ) = μ||μ = Xb|
|'binomial'||'logit'||f(μ) = log(μ/(1–μ))||μ = exp(Xb) / (1 + exp(Xb))|
|'poisson'||'log'||f(μ) = log(μ)||μ = exp(Xb)|
|'gamma'||-1||f(μ) = 1/μ||μ = 1/(Xb)|
|'inverse gaussian'||-2||f(μ) = 1/μ2||μ = (Xb)–1/2|
The hat matrix H is defined in terms of the data matrix X and a diagonal weight matrix W:
H = X(XTWX)–1XTWT.
W has diagonal elements wi:
g is the link function mapping yi to xib.
is the derivative of the link function g.
V is the variance function.
μi is the ith mean.
The diagonal elements Hii satisfy
where n is the number of observations (rows of X), and p is the number of coefficients in the regression model.
The leverage of observation i is the value of the ith diagonal term, hii, of the hat matrix H. Because the sum of the leverage values is p (the number of coefficients in the regression model), an observation i can be considered to be an outlier if its leverage substantially exceeds p/n, where n is the number of observations.
The Cook's distance Di of observation i is
is the dispersion parameter (estimated or theoretical).
ei is the linear predictor residual, , where
g is the link function.
yi is the observed response.
xi is the observation.
is the estimated coefficient vector.
p is the number of coefficients in the regression model.
hii is the ith diagonal element of the Hat Matrix H.
Deviance of a model M1 is twice the difference between the loglikelihood of that model and the saturated model, MS. The saturated model is the model with the maximum number of parameters that can be estimated. For example, if there are n observations yi, i = 1, 2, ..., n, with potentially different values for XiTβ, then you can define a saturated model with n parameters. Let L(b,y) denote the maximum value of the likelihood function for a model. Then the deviance of model M1 is
where b1 are the estimated parameters for model M1 and bS are the estimated parameters for the saturated model. The deviance has a chi-square distribution with n – p degrees of freedom, where n is the number of parameters in the saturated model and p is the number of parameters in model M1.
If M1 and M2 are two different generalized linear models, then the fit of the models can be assessed by comparing the deviances D1 and D2 of these models. The difference of the deviances is
Asymptotically, this difference has a chi-square distribution with degrees of freedom v equal to the number of parameters that are estimated in one model but fixed (typically at 0) in the other. That is, it is equal to the difference in the number of parameters estimated in M1 and M2. You can get the p-value for this test using 1 - chi2cdf(D,V), where D = D2 – D1.
Value. To learn how value classes affect copy operations, see Copying Objects in the MATLAB® documentation.
Fit a logistic regression model of probability of smoking as a function of age, weight, and sex, using a two-way interactions model.
Load the hospital dataset array.
load hospital ds = hospital; % just to use the ds name
Specify the model using a formula that allows up to two-way interactions.
modelspec = 'Smoker ~ Age*Weight*Sex - Age:Weight:Sex';
Create the generalized linear model.
mdl = fitglm(ds,modelspec,'Distribution','binomial')
mdl = Generalized Linear regression model: logit(Smoker) ~ 1 + Sex*Age + Sex*Weight + Age*Weight Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue (Intercept) -6.0492 19.749 -0.3063 0.75938 Sex_Male -2.2859 12.424 -0.18399 0.85402 Age 0.11691 0.50977 0.22934 0.81861 Weight 0.031109 0.15208 0.20455 0.83792 Sex_Male:Age 0.020734 0.20681 0.10025 0.92014 Sex_Male:Weight 0.01216 0.053168 0.22871 0.8191 Age:Weight -0.00071959 0.0038964 -0.18468 0.85348 100 observations, 93 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 5.07, p-value = 0.535
The large p-value indicates the model might not differ statistically from a constant.
Create response data using just three of 20 predictors, and create a generalized linear model stepwise to see if it uses just the correct predictors.
Create data with 20 predictors, and Poisson response using just three of the predictors, plus a constant.
rng('default') % for reproducibility X = randn(100,20); mu = exp(X(:,[5 10 15])*[.4;.2;.3] + 1); y = poissrnd(mu);
Fit a generalized linear model using the Poisson distribution.
mdl = stepwiseglm(X,y,... 'constant','upper','linear','Distribution','poisson')
1. Adding x5, Deviance = 134.439, Chi2Stat = 52.24814, PValue = 4.891229e-13 2. Adding x15, Deviance = 106.285, Chi2Stat = 28.15393, PValue = 1.1204e-07 3. Adding x10, Deviance = 95.0207, Chi2Stat = 11.2644, PValue = 0.000790094 mdl = Generalized Linear regression model: log(y) ~ 1 + x5 + x10 + x15 Distribution = Poisson Estimated Coefficients: Estimate SE tStat pValue (Intercept) 1.0115 0.064275 15.737 8.4217e-56 x5 0.39508 0.066665 5.9263 3.0977e-09 x10 0.18863 0.05534 3.4085 0.0006532 x15 0.29295 0.053269 5.4995 3.8089e-08 100 observations, 96 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 91.7, p-value = 9.61e-20