# GeneralizedLinearModel class

Generalized linear regression model class

## Description

An object comprising training data, model description, diagnostic information, and fitted coefficients for a generalized linear regression. Predict model responses with the `predict` or `feval` methods.

## Construction

```mdl = fitglm(tbl)``` or ```mdl = fitglm(X,y)``` creates a generalized linear model of a table or dataset array `tbl`, or of the responses `y` to a data matrix `X`. For details, see `fitglm`.

`mdl = stepwiseglm(tbl)` or ```mdl = stepwiseglm(X,y)``` creates a generalized linear model of a table or dataset array `tbl`, or of the responses `y` to a data matrix `X`, with unimportant predictors excluded. For details, see `stepwiseglm`.

collapse all

### `tbl` — Input datatable | dataset array

Input data, specified as a table or dataset array. When `modelspec` is a `formula`, it specifies the variables to be used as the predictors and response. Otherwise, if you do not specify the predictor and response variables, the last variable is the response variable and the others are the predictor variables by default.

Predictor variables can be numeric, or any grouping variable type, such as logical or categorical (see Grouping Variables). The response must be numeric or logical.

To set a different column as the response variable, use the `ResponseVar` name-value pair argument. To use a subset of the columns as predictors, use the `PredictorVars` name-value pair argument.

Data Types: `single` | `double` | `logical`

### `X` — Predictor variablesmatrix

Predictor variables, specified as an n-by-p matrix, where n is the number of observations and p is the number of predictor variables. Each column of `X` represents one variable, and each row represents one observation.

By default, there is a constant term in the model, unless you explicitly remove it, so do not include a column of 1s in `X`.

Data Types: `single` | `double` | `logical`

### `y` — Response variablevector

Response variable, specified as an n-by-1 vector, where n is the number of observations. Each entry in `y` is the response for the corresponding row of `X`.

Data Types: `single` | `double`

## Properties

`CoefficientCovariance`

Covariance matrix of coefficient estimates.

`CoefficientNames`

Cell array of strings containing a label for each coefficient.

`Coefficients`

Coefficient values stored as a table. `Coefficients` has one row for each coefficient and these columns:

• `Estimate` — Estimated coefficient value

• `SE` — Standard error of the estimate

• `tStat`t statistic for a test that the coefficient is zero

• `pValue`p-value for the t statistic

To obtain any of these columns as a vector, index into the property using dot notation. For example, in `mdl` the estimated coefficient vector is

`beta = mdl.Coefficients.Estimate`

Use `coefTest` to perform other tests on the coefficients.

`Deviance`

Deviance of the fit. It is useful for comparing two models when one is a special case of the other. The difference between the deviance of the two models has a chi-square distribution with degrees of freedom equal to the difference in the number of estimated parameters between the two models. For more information on deviance, see Deviance.

`DFE`

Degrees of freedom for error (residuals), equal to the number of observations minus the number of estimated coefficients.

`Diagnostics`

Table with diagnostics helpful in finding outliers and influential observations. The table contains the following fields:

FieldMeaningUtility
`Leverage`Diagonal elements of `HatMatrix`Leverage indicates to what extent the predicted value for an observation is determined by the observed value for that observation. A value close to `1` indicates that the prediction is largely determined by that observation, with little contribution from the other observations. A value close to `0` indicates the fit is largely determined by the other observations. For a model with p coefficients and n observations, the average value of `Leverage` is p/n. An observation with `Leverage` larger than 2*p/n can be an outlier.
`CooksDistance`Cook's measure of scaled change in fitted values`CooksDistance` is a measure of scaled change in fitted values. An observation with `CooksDistance` larger than three times the mean Cook's distance can be an outlier.
`HatMatrix`Projection matrix to compute fitted from observed responses`HatMatrix` is an n-by-n matrix such that `Fitted = HatMatrix*Y`, where `Y` is the response vector and `Fitted` is the vector of fitted response values.

All of these quantities are computed on the scale of the linear predictor. So, for example, in the equation that defines the hat matrix,

```Yfit = glm.Fitted.LinearPredictor Y = glm.Fitted.LinearPredictor + glm.Residuals.LinearPredictor```

`Dispersion`

Scale factor of the variance of the response. `Dispersion` multiplies the variance function for the distribution.

For example, the variance function for the binomial distribution is p(1–p)/n, where p is the probability parameter and n is the sample size parameter. If `Dispersion` is near `1`, the variance of the data appears to agree with the theoretical variance of the binomial distribution. If `Dispersion` is larger than `1`, the data are "overdispersed" relative to the binomial distribution.

`DispersionEstimated`

Logical value indicating whether `fitglm` used the `Dispersion` property to compute standard errors for the coefficients in `Coefficients.SE`. If `DispersionEstimated` is `false`, `fitglm` used the theoretical value of the variance.

• `DispersionEstimated` can be `false` only for `'binomial'` or `'poisson'` distributions.

• Set `DispersionEstimated` by setting the `DispersionFlag` name-value pair in `fitglm`.

`Distribution`

Structure with the following fields relating to the generalized distribution:

FieldDescription
`Name`Name of the distribution, one of `'normal'`, `'binomial'`, `'poisson'`, `'gamma'`, or `'inverse gamma'`.
`DevianceFunction`Function that computes the components of the deviance as a function of the fitted parameter values and the response values.
`VarianceFunction`Function that computes the theoretical variance for the distribution as a function of the fitted parameter values. When `DispersionEstimated` is `true`, `Dispersion` multiplies the variance function in the computation of the coefficient standard errors.

`Fitted`

Table of predicted (fitted) values based on the training data, a table with one row for each observation and the following columns.

FieldDescription
`Response`Predicted values on the scale of the response.
`LinearPredictor`Predicted values on the scale of the linear predictor. These are the same as the link function applied to the `Response` fitted values.
`Probability`Fitted probabilities (this column is included only with the binomial distribution).

To obtain any of the columns as a vector, index into the property using dot notation. For example, in the model `mdl`, the vector `f` of fitted values on the response scale is

`f = mdl.Fitted.Response`

Use `predict` to compute predictions for other predictor values, or to compute confidence bounds on `Fitted`.

`Formula`

Object containing information about the model.

`Link`

Structure with fields relating to the link function. The link is a function f that links the distribution parameter μ to the fitted linear combination Xb of the predictors:

f(μ) = Xb.

The structure has the following fields.

FieldDescription
`Name`Name of the link function, or `''` if you specified the link as a function handle rather than a string.
`LinkFunction`The function that defines f, a function handle.
`DevianceFunction`Derivative of f, a function handle.
`VarianceFunction`Inverse of f, a function handle.

`LogLikelihood`

Log likelihood of the model distribution at the response values, with mean fitted from the model, and other parameters estimated as part of the model fit.

`ModelCriterion`

`AIC` and other information criteria for comparing models. A structure with fields:

• `AIC` — Akaike information criterion

• `AICc` — Akaike information criterion corrected for sample size

• `BIC` — Bayesian information criterion

• `CAIC` — Consistent Akaike information criterion

To obtain any of these values as a scalar, index into the property using dot notation. For example, in a model `mdl`, the AIC value `aic` is:

`aic = mdl.ModelCriterion.AIC`

`NumCoefficients`

Number of coefficients in the model, a positive integer. `NumCoefficients` includes coefficients that are set to zero when the model terms are rank deficient.

`NumEstimatedCoefficients`

Number of estimated coefficients in the model, a positive integer. `NumEstimatedCoefficients` does not include coefficients that are set to zero when the model terms are rank deficient. `NumEstimatedCoefficients` is the degrees of freedom for regression.

`NumObservations`

Number of observations the fitting function used in fitting. This is the number of observations supplied in the original table, dataset, or matrix, minus any excluded rows (set with the `Excluded` name-value pair) or rows with missing values.

`NumPredictors`

Number of variables `fitlm` used as predictors for fitting.

`NumVariables`

Number of variables in the data. `NumVariables` is the number of variables in the original table or dataset, or the total number of columns in the predictor matrix and response vector when the fit is based on those arrays. It includes variables, if any, that are not used as predictors or as the response.

`ObservationInfo`

Table with the same number of rows as the input data (`tbl` or `X`).

FieldDescription
`Weights`Observation weights. Default is all `1`.
`Excluded`Logical value, `1` indicates an observation that you excluded from the fit with the `Exclude` name-value pair.
`Missing`Logical value, `1` indicates a missing value in the input. Missing values are not used in the fit.
`Subset`Logical value, `1` indicates the observation is not excluded or missing, so is used in the fit.

`ObservationNames`

Cell array of strings containing the names of the observations used in the fit.

• If the fit is based on a table or dataset containing observation names, `ObservationNames` uses those names.

• Otherwise, `ObservationNames` is an empty cell array

`Offset`

Vector with the same length as the number of rows in the data, passed from `fitglm` or `stepwiseglm` in the `Offset` name-value pair. The fitting function used `Offset` as a predictor variable, but with the coefficient set to exactly `1`. In other words, the formula for fitting was

μ``` ~ Offset + (terms involving real predictors)```

with the `Offset` predictor having coefficient `1`.

For example, consider a Poisson regression model. Suppose the number of counts is known for theoretical reasons to be proportional to a predictor `A`. By using the log link function and by specifying `log(A)` as an offset, you can force the model to satisfy this theoretical constraint.

`PredictorNames`

Cell array of strings, the names of the predictors used in fitting the model.

`Residuals`

Table containing residuals, with one row for each observation and these variables.

FieldDescription
`Raw`Observed minus fitted values.
`LinearPredictor`Residuals on the linear predictor scale, equal to the adjusted response value minus the fitted linear combination of the predictors.
`Pearson`Raw residuals divided by the estimated standard deviation of the response.
`Anscombe`Residuals defined on transformed data with the transformation chosen to remove skewness.
`Deviance`Residuals based on the contribution of each observation to the deviance.

To obtain any of these columns as a vector, index into the property using dot notation. For example, in a model `mdl`, the ordinary raw residual vector `r` is:

`r = mdl.Residuals.Raw`

Rows not used in the fit because of missing values (in `ObservationInfo.Missing`) contain `NaN` values.

Rows not used in the fit because of excluded values (in `ObservationInfo.Excluded`) contain `NaN` values, with the following exceptions:

• `raw` contains the difference between the observed and predicted values.

• `standardized` is the residual, standardized in the usual way.

• `studentized` matches the standardized values because this residual is not used in the estimate of the residual standard deviation.

`ResponseName`

String giving naming the response variable.

`Rsquared`

Proportion of total sum of squares explained by the model. The ordinary R-squared value relates to the `SSR` and `SST` properties:

`Rsquared = SSR/SST = 1 - SSE/SST`.

For a linear or nonlinear model, `Rsquared` is a structure with two fields:

• `Ordinary` — Ordinary (unadjusted) R-squared

• `Adjusted` — R-squared adjusted for the number of coefficients

For a generalized linear model, `Rsquared` is a structure with five fields:

• `Ordinary` — Ordinary (unadjusted) R-squared

• `Adjusted` — R-squared adjusted for the number of coefficients

• `LLR` — Log-likelihood ratio

• `Deviance` — Deviance

• `AdjGeneralized` — Adjusted generalized R-squared

To obtain any of these values as a scalar, index into the property using dot notation. For example, the adjusted R-squared value in `mdl` is

`r2 = mdl.Rsquared.Adjusted`

`SSE`

Sum of squared errors (residuals).

The Pythagorean theorem implies

`SST = SSE + SSR`.

`SSR`

Regression sum of squares, the sum of squared deviations of the fitted values from their mean.

The Pythagorean theorem implies

`SST = SSE + SSR`.

`SST`

Total sum of squares, the sum of squared deviations of `y` from `mean(y)`.

The Pythagorean theorem implies

`SST = SSE + SSR`.

`Steps`

Structure that is empty unless `stepwiselm` constructed the model.

FieldDescription
`Start`Formula representing the starting model
`Lower`Formula representing the lower bound model, these terms that must remain in the model
`Upper`Formula representing the upper bound model, model cannot contain more terms than `Upper`
`Criterion`Criterion used for the stepwise algorithm, such as `'sse'`
`PEnter`Value of the parameter, such as `0.05`
`PRemove`Value of the parameter, such as `0.10`
`History`Table representing the steps taken in the fit

The `History` table has one row for each step including the initial fit, and the following variables (columns).

FieldDescription
`Action`Action taken during this step, one of:
• `'Start'` — First step

• `'Add'` — A term is added

• `'Remove'` — A term is removed

`TermName`
• `'Start'` step: The starting model specification

• `'Add'` or `'Remove'` steps: The term moved in that step

`Terms`Terms matrix (see `modelspec` of `fitlm`)
`DF`Regression degrees of freedom after this step
`delDF`Change in regression degrees of freedom from previous step (negative for steps that remove a term)
`Deviance`Deviance (residual sum of squares) at that step
`FStat`F statistic that led to this step
`PValue`p-value of the F statistic

`VariableInfo`

Table containing metadata about `Variables`. There is one row for each term in the model, and the following columns.

FieldDescription
`Class`String giving variable class, such as `'double'`
`Range`Cell array giving variable range:
• Continuous variable — Two-element vector `[min,max]`, the minimum and maximum values

• Categorical variable — Cell array of distinct variable values

`InModel`Logical vector, where `true` indicates the variable is in the model
`IsCategorical`Logical vector, where `true` indicates a categorical variable

`VariableNames`

Cell array of strings containing names of the variables in the fit.

• If the fit is based on a table or dataset, this property provides the names of the variables in that table or dataset.

• If the fit is based on a predictor matrix and response vector, `VariableNames` is the values in the `VarNames` name-value pair of the fitting method.

• Otherwise the variables have the default fitting names.

`Variables`

Table containing the data, both observations and responses, that the fitting function used to construct the fit. If the fit is based on a table or dataset array, `Variables` contains all of the data from that table or dataset array. Otherwise, `Variables` is a table created from the input data matrix `X` and response vector `y`.

## Methods

 addTerms Add terms to generalized linear model coefCI Confidence intervals of coefficient estimates of generalized linear model coefTest Linear hypothesis test on generalized linear regression model coefficients devianceTest Analysis of deviance disp Display generalized linear regression model feval Evaluate generalized linear regression model prediction fit Create generalized linear regression model plotDiagnostics Plot diagnostics of generalized linear regression model plotResiduals Plot residuals of generalized linear regression model plotSlice Plot of slices through fitted generalized linear regression surface predict Predict response of generalized linear regression model random Simulate responses for generalized linear regression model removeTerms Remove terms from generalized linear model step Improve generalized linear regression model by adding or removing terms stepwise Create generalized linear regression model by stepwise regression

## Definitions

The default link function for a generalized linear model is the canonical link function.

Canonical Link Functions for Generalized Linear Models

`'normal'``'identity'`f(μ) = μμ = Xb
`'binomial'``'logit'`f(μ) = log(μ/(1–μ))μ = exp(Xb) / (1 + exp(Xb))
`'poisson'``'log'`f(μ) = log(μ)μ = exp(Xb)
`'gamma'``-1`f(μ) = 1/μμ = 1/(Xb)
`'inverse gaussian'``-2`f(μ) = 1/μ2μ = (Xb)–1/2

### Hat Matrix

The hat matrix H is defined in terms of the data matrix X and a diagonal weight matrix W:

H = X(XTWX)–1XTWT.

W has diagonal elements wi:

${w}_{i}=\frac{{g}^{\prime }\left({\mu }_{i}\right)}{\sqrt{V\left({\mu }_{i}\right)}},$

where

• g is the link function mapping yi to xib.

• ${g}^{\prime }$ is the derivative of the link function g.

• V is the variance function.

• μi is the ith mean.

The diagonal elements Hii satisfy

$\begin{array}{l}0\le {h}_{ii}\le 1\\ \sum _{i=1}^{n}{h}_{ii}=p,\end{array}$

where n is the number of observations (rows of X), and p is the number of coefficients in the regression model.

### Leverage

The leverage of observation i is the value of the ith diagonal term, hii, of the hat matrix H. Because the sum of the leverage values is p (the number of coefficients in the regression model), an observation i can be considered to be an outlier if its leverage substantially exceeds p/n, where n is the number of observations.

### Cook's Distance

The Cook's distance Di of observation i is

${D}_{i}={w}_{i}\frac{{e}_{i}^{2}}{p\stackrel{^}{\phi }}\frac{{h}_{ii}}{{\left(1-{h}_{ii}\right)}^{2}},$

where

• $\stackrel{^}{\phi }$ is the dispersion parameter (estimated or theoretical).

• ei is the linear predictor residual, $g\left({y}_{i}\right)-{x}_{i}\stackrel{^}{\beta }$, where

• g is the link function.

• yi is the observed response.

• xi is the observation.

• $\stackrel{^}{\beta }$ is the estimated coefficient vector.

• p is the number of coefficients in the regression model.

• hii is the ith diagonal element of the Hat Matrix H.

### Deviance

Deviance of a model M1 is twice the difference between the loglikelihood of that model and the saturated model, MS. The saturated model is the model with the maximum number of parameters that can be estimated. For example, if there are n observations yi, i = 1, 2, ..., n, with potentially different values for XiTβ, then you can define a saturated model with n parameters. Let L(b,y) denote the maximum value of the likelihood function for a model. Then the deviance of model M1 is

$-2\left(\mathrm{log}L\left({b}_{1},y\right)-\mathrm{log}L\left({b}_{S},y\right)\right),$

where b1 are the estimated parameters for model M1 and bS are the estimated parameters for the saturated model. The deviance has a chi-square distribution with np degrees of freedom, where n is the number of parameters in the saturated model and p is the number of parameters in model M1.

If M1 and M2 are two different generalized linear models, then the fit of the models can be assessed by comparing the deviances D1 and D2 of these models. The difference of the deviances is

$\begin{array}{l}D={D}_{2}-{D}_{1}=-2\left(\mathrm{log}L\left({b}_{2},y\right)-\mathrm{log}L\left({b}_{S},y\right)\right)+2\left(\mathrm{log}L\left({b}_{1},y\right)-\mathrm{log}L\left({b}_{S},y\right)\right)\\ \text{ }\text{ }\text{ }\text{\hspace{0.17em}}\text{\hspace{0.17em}}=-2\left(\mathrm{log}L\left({b}_{2},y\right)-\mathrm{log}L\left({b}_{1},y\right)\right).\end{array}$

Asymptotically, this difference has a chi-square distribution with degrees of freedom v equal to the number of parameters that are estimated in one model but fixed (typically at 0) in the other. That is, it is equal to the difference in the number of parameters estimated in M1 and M2. You can get the p-value for this test using `1 - chi2cdf(D,V)`, where D = D2D1.

## Copy Semantics

Value. To learn how value classes affect copy operations, see Copying Objects in the MATLAB® documentation.

## Examples

collapse all

### Fit a Generalized Linear Model

Fit a logistic regression model of probability of smoking as a function of age, weight, and sex, using a two-way interactions model.

Load the `hospital` dataset array.

```load hospital ds = hospital; % just to use the ds name```

Specify the model using a formula that allows up to two-way interactions.

`modelspec = 'Smoker ~ Age*Weight*Sex - Age:Weight:Sex';`

Create the generalized linear model.

`mdl = fitglm(ds,modelspec,'Distribution','binomial')`
```mdl = Generalized Linear regression model: logit(Smoker) ~ 1 + Sex*Age + Sex*Weight + Age*Weight Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue (Intercept) -6.0492 19.749 -0.3063 0.75938 Sex_Male -2.2859 12.424 -0.18399 0.85402 Age 0.11691 0.50977 0.22934 0.81861 Weight 0.031109 0.15208 0.20455 0.83792 Sex_Male:Age 0.020734 0.20681 0.10025 0.92014 Sex_Male:Weight 0.01216 0.053168 0.22871 0.8191 Age:Weight -0.00071959 0.0038964 -0.18468 0.85348 100 observations, 93 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 5.07, p-value = 0.535```

The large p-value indicates the model might not differ statistically from a constant.

### Create a Generalized Linear Model Stepwise

Create response data using just three of 20 predictors, and create a generalized linear model stepwise to see if it uses just the correct predictors.

Create data with 20 predictors, and Poisson response using just three of the predictors, plus a constant.

```rng('default') % for reproducibility X = randn(100,20); mu = exp(X(:,[5 10 15])*[.4;.2;.3] + 1); y = poissrnd(mu);```

Fit a generalized linear model using the Poisson distribution.

```mdl = stepwiseglm(X,y,... 'constant','upper','linear','Distribution','poisson')```
```1. Adding x5, Deviance = 134.439, Chi2Stat = 52.24814, PValue = 4.891229e-13 2. Adding x15, Deviance = 106.285, Chi2Stat = 28.15393, PValue = 1.1204e-07 3. Adding x10, Deviance = 95.0207, Chi2Stat = 11.2644, PValue = 0.000790094 mdl = Generalized Linear regression model: log(y) ~ 1 + x5 + x10 + x15 Distribution = Poisson Estimated Coefficients: Estimate SE tStat pValue (Intercept) 1.0115 0.064275 15.737 8.4217e-56 x5 0.39508 0.066665 5.9263 3.0977e-09 x10 0.18863 0.05534 3.4085 0.0006532 x15 0.29295 0.053269 5.4995 3.8089e-08 100 observations, 96 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 91.7, p-value = 9.61e-20```