# GeneralizedLinearModel class

Generalized linear regression model class

## Description

An object comprising training data, model description, diagnostic information, and fitted coefficients for a generalized linear regression. Predict model responses with the `predict` or `feval` methods.

## Construction

```mdl = fitglm(tbl)``` or ```mdl = fitglm(X,y)``` creates a generalized linear model of a table or dataset array `tbl`, or of the responses `y` to a data matrix `X`. For details, see `fitglm`.

`mdl = stepwiseglm(tbl)` or ```mdl = stepwiseglm(X,y)``` creates a generalized linear model of a table or dataset array `tbl`, or of the responses `y` to a data matrix `X`, with unimportant predictors excluded. For details, see `stepwiseglm`.

expand all

### `tbl` — Input datatable | dataset array

Input data, specified as a table or dataset array. When `modelspec` is a `formula`, it specifies the variables to be used as the predictors and response. Otherwise, if you do not specify the predictor and response variables, the last variable is the response variable and the others are the predictor variables by default.

Predictor variables can be numeric, or any grouping variable type, such as logical or categorical (see Grouping Variables). The response must be numeric or logical.

To set a different column as the response variable, use the `ResponseVar` name-value pair argument. To use a subset of the columns as predictors, use the `PredictorVars` name-value pair argument.

Data Types: `single` | `double` | `logical`

### `X` — Predictor variablesmatrix

Predictor variables, specified as an n-by-p matrix, where n is the number of observations and p is the number of predictor variables. Each column of `X` represents one variable, and each row represents one observation.

By default, there is a constant term in the model, unless you explicitly remove it, so do not include a column of 1s in `X`.

Data Types: `single` | `double` | `logical`

### `y` — Response variablevector

Response variable, specified as an n-by-1 vector, where n is the number of observations. Each entry in `y` is the response for the corresponding row of `X`.

Data Types: `single` | `double`

## Properties

expand all

### `CoefficientCovariance` — Covariance matrix of coefficient estimatesnumeric matrix

Covariance matrix of coefficient estimates, stored as a p-by-p matrix of numeric values. p is the number of coefficients in the fitted model.

### `CoefficientNames` — Coefficient namescell array of strings

Coefficient names, stored as a cell array of strings containing a label for each coefficient.

### `Coefficients` — Coefficient valuestable

Coefficient values, stored as a table. `Coefficients` has one row for each coefficient and the following columns:

• `Estimate` — Estimated coefficient value

• `SE` — Standard error of the estimate

• `tStat`t statistic for a test that the coefficient is zero

• `pValue`p-value for the t statistic

To obtain any of these columns as a vector, index into the property using dot notation. For example, in `mdl` the estimated coefficient vector is

`beta = mdl.Coefficients.Estimate`

Use `coefTest` to perform other tests on the coefficients.

### `Deviance` — Deviance of the fitnumeric value

Deviance of the fit, stored as a numeric value. Deviance is useful for comparing two models when one is a special case of the other. The difference between the deviance of the two models has a chi-square distribution with degrees of freedom equal to the difference in the number of estimated parameters between the two models. For more information on deviance, see Deviance.

### `DFE` — Degrees of freedom for errorpositive integer value

Degrees of freedom for error (residuals), equal to the number of observations minus the number of estimated coefficients, stored as a positive integer value.

### `Diagnostics` — Diagnostic informationtable

Diagnostic information for the model, stored as a table. Diagnostics can help identify outliers and influential observations. `Diagnostics` contains the following fields:

FieldMeaningUtility
`Leverage`Diagonal elements of `HatMatrix`Leverage indicates to what extent the predicted value for an observation is determined by the observed value for that observation. A value close to `1` indicates that the prediction is largely determined by that observation, with little contribution from the other observations. A value close to `0` indicates the fit is largely determined by the other observations. For a model with p coefficients and n observations, the average value of `Leverage` is p/n. An observation with `Leverage` larger than 2*p/n can be an outlier.
`CooksDistance`Cook's measure of scaled change in fitted values`CooksDistance` is a measure of scaled change in fitted values. An observation with `CooksDistance` larger than three times the mean Cook's distance can be an outlier.
`HatMatrix`Projection matrix to compute fitted from observed responses`HatMatrix` is an n-by-n matrix such that `Fitted = HatMatrix*Y`, where `Y` is the response vector and `Fitted` is the vector of fitted response values.

All of these quantities are computed on the scale of the linear predictor. So, for example, in the equation that defines the hat matrix,

```Yfit = glm.Fitted.LinearPredictor Y = glm.Fitted.LinearPredictor + glm.Residuals.LinearPredictor```

### `Dispersion` — Scale factor of the variance of the responsestructure

Scale factor of the variance of the response, stored as a structure. `Dispersion` multiplies the variance function for the distribution.

For example, the variance function for the binomial distribution is p(1–p)/n, where p is the probability parameter and n is the sample size parameter. If `Dispersion` is near `1`, the variance of the data appears to agree with the theoretical variance of the binomial distribution. If `Dispersion` is larger than `1`, the data are "overdispersed" relative to the binomial distribution.

### `DispersionEstimated` — Flag to indicate use of dispersion scale factorlogical value

Flag to indicate whether `fitglm` used the `Dispersion` scale factor to compute standard errors for the coefficients in `Coefficients.SE`, stored as a logical value. If `DispersionEstimated` is `false`, `fitglm` used the theoretical value of the variance.

• `DispersionEstimated` can be `false` only for `'binomial'` or `'poisson'` distributions.

• Set `DispersionEstimated` by setting the `DispersionFlag` name-value pair in `fitglm`.

### `Distribution` — Generalized distribution informationstructure

Generalized distribution information, stored as a structure with the following fields relating to the generalized distribution:

FieldDescription
`Name`Name of the distribution, one of `'normal'`, `'binomial'`, `'poisson'`, `'gamma'`, or `'inverse gamma'`.
`DevianceFunction`Function that computes the components of the deviance as a function of the fitted parameter values and the response values.
`VarianceFunction`Function that computes the theoretical variance for the distribution as a function of the fitted parameter values. When `DispersionEstimated` is `true`, `Dispersion` multiplies the variance function in the computation of the coefficient standard errors.

### `Fitted` — Fitted response values based on input datatable

Fitted (predicted) values based on the input data, stored as a table with one row for each observation and the following columns.

FieldDescription
`Response`Predicted values on the scale of the response.
`LinearPredictor`Predicted values on the scale of the linear predictor. These are the same as the link function applied to the `Response` fitted values.
`Probability`Fitted probabilities (this column is included only with the binomial distribution).

To obtain any of the columns as a vector, index into the property using dot notation. For example, in the model `mdl`, the vector `f` of fitted values on the response scale is

`f = mdl.Fitted.Response`

Use `predict` to compute predictions for other predictor values, or to compute confidence bounds on `Fitted`.

### `Formula` — Model information`LinearFormula` object | `NonLinearFormula` object

Model information, stored as a `LinearFormula` object or `NonLinearFormula` object. If you fit a linear or generalized linear regression model, then `Formula` is a `LinearFormula` object. If you fit a nonlinear regression model, then `Formula` is a `NonLinearFormula` object.

Link function, stored as a structure with the following fields:

FieldDescription
`Name`Name of the link function, or `''` if you specified the link as a function handle rather than a string.
`LinkFunction`The function that defines f, a function handle.
`DevianceFunction`Derivative of f, a function handle.
`VarianceFunction`Inverse of f, a function handle.

The link is a function f that links the distribution parameter μ to the fitted linear combination Xb of the predictors:

f(μ) = Xb.

### `LogLikelihood` — Log likelihoodnumeric value

Log likelihood of the model distribution at the response values, stored as a numeric value. The mean is fitted from the model, and other parameters are estimated as part of the model fit.

### `ModelCriterion` — Criterion for model comparisonstructure

Criterion for model comparison, stored as a structure with the following fields:

• `AIC` — Akaike information criterion

• `AICc` — Akaike information criterion corrected for sample size

• `BIC` — Bayesian information criterion

• `CAIC` — Consistent Akaike information criterion

To obtain any of these values as a scalar, index into the property using dot notation. For example, in a model `mdl`, the AIC value `aic` is:

`aic = mdl.ModelCriterion.AIC`

### `NumCoefficients` — Number of model coefficientspositive integer

Number of model coefficients, stored as a positive integer. `NumCoefficients` includes coefficients that are set to zero when the model terms are rank deficient.

### `NumEstimatedCoefficients` — Number of estimated coefficientspositive integer

Number of estimated coefficients in the model, stored as a positive integer. `NumEstimatedCoefficients` does not include coefficients that are set to zero when the model terms are rank deficient. `NumEstimatedCoefficients` is the degrees of freedom for regression.

### `NumObservations` — Number of observationspositive integer

Number of observations the fitting function used in fitting, stored as a positive integer. This is the number of observations supplied in the original table, dataset, or matrix, minus any excluded rows (set with the `Excluded` name-value pair) or rows with missing values.

### `NumPredictors` — Number of predictor variablespositive integer

Number of predictor variables used to fit the model, stored as a positive integer.

### `NumVariables` — Number of variablespositive integer

Number of variables in the input data, stored as a positive integer. `NumVariables` is the number of variables in the original table or dataset, or the total number of columns in the predictor matrix and response vector when the fit is based on those arrays. It includes variables, if any, that are not used as predictors or as the response.

### `ObservationInfo` — Observation informationtable

Observation information, stored as a n-by-4 table, where n is equal to the number of rows of input data. The four columns of `ObservationInfo` contain the following:

FieldDescription
`Weights`Observation weights. Default is all `1`.
`Excluded`Logical value, `1` indicates an observation that you excluded from the fit with the `Exclude` name-value pair.
`Missing`Logical value, `1` indicates a missing value in the input. Missing values are not used in the fit.
`Subset`Logical value, `1` indicates the observation is not excluded or missing, so is used in the fit.

### `ObservationNames` — Observation namescell array

Observation names, stored as a cell array of strings containing the names of the observations used in the fit.

• If the fit is based on a table or dataset containing observation names, `ObservationNames` uses those names.

• Otherwise, `ObservationNames` is an empty cell array

### `Offset` — Offset variablenumeric vector

, stored as a numeric vector with the same length as the number of rows in the data. `Offset` is passed from `fitglm` or `stepwiseglm` in the `Offset` name-value pair. The fitting function used `Offset` as a predictor variable, but with the coefficient set to exactly `1`. In other words, the formula for fitting was

μ``` ~ Offset + (terms involving real predictors)```

with the `Offset` predictor having coefficient `1`.

For example, consider a Poisson regression model. Suppose the number of counts is known for theoretical reasons to be proportional to a predictor `A`. By using the log link function and by specifying `log(A)` as an offset, you can force the model to satisfy this theoretical constraint.

### `PredictorNames` — Names of predictors used to fit the modelcell array

Names of predictors used to fit the model, stored as a cell array of strings.

### `Residuals` — Residuals for fitted modeltable

Residuals for the fitted model, stored as a table with one row for each observation and the following columns.

FieldDescription
`Raw`Observed minus fitted values.
`LinearPredictor`Residuals on the linear predictor scale, equal to the adjusted response value minus the fitted linear combination of the predictors.
`Pearson`Raw residuals divided by the estimated standard deviation of the response.
`Anscombe`Residuals defined on transformed data with the transformation chosen to remove skewness.
`Deviance`Residuals based on the contribution of each observation to the deviance.

To obtain any of these columns as a vector, index into the property using dot notation. For example, in a model `mdl`, the ordinary raw residual vector `r` is:

`r = mdl.Residuals.Raw`

Rows not used in the fit because of missing values (in `ObservationInfo.Missing`) contain `NaN` values.

Rows not used in the fit because of excluded values (in `ObservationInfo.Excluded`) contain `NaN` values, with the following exceptions:

• `raw` contains the difference between the observed and predicted values.

• `standardized` is the residual, standardized in the usual way.

• `studentized` matches the standardized values because this residual is not used in the estimate of the residual standard deviation.

### `ResponseName` — Response variable namestring

Response variable name, stored as a string.

### `Rsquared` — R-squared value for the modelstructure

R-squared value for the model, stored as a structure.

For a linear or nonlinear model, `Rsquared` is a structure with two fields:

• `Ordinary` — Ordinary (unadjusted) R-squared

• `Adjusted` — R-squared adjusted for the number of coefficients

For a generalized linear model, `Rsquared` is a structure with five fields:

• `Ordinary` — Ordinary (unadjusted) R-squared

• `Adjusted` — R-squared adjusted for the number of coefficients

• `LLR` — Log-likelihood ratio

• `Deviance` — Deviance

• `AdjGeneralized` — Adjusted generalized R-squared

The R-squared value is the proportion of total sum of squares explained by the model. The ordinary R-squared value relates to the `SSR` and `SST` properties:

`Rsquared = SSR/SST = 1 - SSE/SST`.

To obtain any of these values as a scalar, index into the property using dot notation. For example, the adjusted R-squared value in `mdl` is

`r2 = mdl.Rsquared.Adjusted`

### `SSE` — Sum of squared errorsnumeric value

Sum of squared errors (residuals), stored as a numeric value.

The Pythagorean theorem implies

`SST = SSE + SSR`.

### `SSR` — Regression sum of squaresnumeric value

Regression sum of squares, stored as a numeric value. The regression sum of squares is equal to the sum of squared deviations of the fitted values from their mean.

The Pythagorean theorem implies

`SST = SSE + SSR`.

### `SST` — Total sum of squaresnumeric value

Total sum of squares, stored as a numeric value. The total sum of squares is equal to the sum of squared deviations of `y` from `mean(y)`.

The Pythagorean theorem implies

`SST = SSE + SSR`.

### `Steps` — Stepwise fitting informationstructure

Stepwise fitting information, stored as a structure with the following fields.

FieldDescription
`Start`Formula representing the starting model
`Lower`Formula representing the lower bound model, these terms that must remain in the model
`Upper`Formula representing the upper bound model, model cannot contain more terms than `Upper`
`Criterion`Criterion used for the stepwise algorithm, such as `'sse'`
`PEnter`Value of the parameter, such as `0.05`
`PRemove`Value of the parameter, such as `0.10`
`History`Table representing the steps taken in the fit

The `History` table has one row for each step including the initial fit, and the following variables (columns).

FieldDescription
`Action`Action taken during this step, one of:
• `'Start'` — First step

• `'Add'` — A term is added

• `'Remove'` — A term is removed

`TermName`
• `'Start'` step: The starting model specification

• `'Add'` or `'Remove'` steps: The term moved in that step

`Terms`Terms matrix (see `modelspec` of `fitlm`)
`DF`Regression degrees of freedom after this step
`delDF`Change in regression degrees of freedom from previous step (negative for steps that remove a term)
`Deviance`Deviance (residual sum of squares) at that step
`FStat`F statistic that led to this step
`PValue`p-value of the F statistic

The structure is empty unless you use `stepwiselm` or `stepwiseglm` to fit the model.

### `VariableInfo` — Information about input variablestable

Information about input variables contained in `Variables`, stored as a table with one row for each model term and the following columns.

FieldDescription
`Class`String giving variable class, such as `'double'`
`Range`Cell array giving variable range:
• Continuous variable — Two-element vector `[min,max]`, the minimum and maximum values

• Categorical variable — Cell array of distinct variable values

`InModel`Logical vector, where `true` indicates the variable is in the model
`IsCategorical`Logical vector, where `true` indicates a categorical variable

### `VariableNames` — Names of variables used in fitcell array

Names of variables used in fit, stored as a cell array of strings.

• If the fit is based on a table or dataset, this property provides the names of the variables in that table or dataset.

• If the fit is based on a predictor matrix and response vector, `VariableNames` is the values in the `VarNames` name-value pair of the fitting method.

• Otherwise the variables have the default fitting names.

### `Variables` — Data used to fit the modeltable

Data used to fit the model, stored as a table. `Variables` contains both observation and response values. If the fit is based on a table or dataset array, `Variables` contains all of the data from that table or dataset array. Otherwise, `Variables` is a table created from the input data matrix `X` and response vector `y`.

## Methods

 addTerms Add terms to generalized linear model coefCI Confidence intervals of coefficient estimates of generalized linear model coefTest Linear hypothesis test on generalized linear regression model coefficients devianceTest Analysis of deviance disp Display generalized linear regression model feval Evaluate generalized linear regression model prediction fit Create generalized linear regression model plotDiagnostics Plot diagnostics of generalized linear regression model plotResiduals Plot residuals of generalized linear regression model plotSlice Plot of slices through fitted generalized linear regression surface predict Predict response of generalized linear regression model random Simulate responses for generalized linear regression model removeTerms Remove terms from generalized linear model step Improve generalized linear regression model by adding or removing terms stepwise Create generalized linear regression model by stepwise regression

## Definitions

The default link function for a generalized linear model is the canonical link function.

Canonical Link Functions for Generalized Linear Models

`'normal'``'identity'`f(μ) = μμ = Xb
`'binomial'``'logit'`f(μ) = log(μ/(1–μ))μ = exp(Xb) / (1 + exp(Xb))
`'poisson'``'log'`f(μ) = log(μ)μ = exp(Xb)
`'gamma'``-1`f(μ) = 1/μμ = 1/(Xb)
`'inverse gaussian'``-2`f(μ) = 1/μ2μ = (Xb)–1/2

### Hat Matrix

The hat matrix H is defined in terms of the data matrix X and a diagonal weight matrix W:

H = X(XTWX)–1XTWT.

W has diagonal elements wi:

`${w}_{i}=\frac{{g}^{\prime }\left({\mu }_{i}\right)}{\sqrt{V\left({\mu }_{i}\right)}},$`

where

• g is the link function mapping yi to xib.

• ${g}^{\prime }$ is the derivative of the link function g.

• V is the variance function.

• μi is the ith mean.

The diagonal elements Hii satisfy

`$\begin{array}{l}0\le {h}_{ii}\le 1\\ \sum _{i=1}^{n}{h}_{ii}=p,\end{array}$`

where n is the number of observations (rows of X), and p is the number of coefficients in the regression model.

### Leverage

The leverage of observation i is the value of the ith diagonal term, hii, of the hat matrix H. Because the sum of the leverage values is p (the number of coefficients in the regression model), an observation i can be considered to be an outlier if its leverage substantially exceeds p/n, where n is the number of observations.

### Cook's Distance

The Cook's distance Di of observation i is

`${D}_{i}={w}_{i}\frac{{e}_{i}^{2}}{p\stackrel{^}{\phi }}\frac{{h}_{ii}}{{\left(1-{h}_{ii}\right)}^{2}},$`

where

• $\stackrel{^}{\phi }$ is the dispersion parameter (estimated or theoretical).

• ei is the linear predictor residual, $g\left({y}_{i}\right)-{x}_{i}\stackrel{^}{\beta }$, where

• g is the link function.

• yi is the observed response.

• xi is the observation.

• $\stackrel{^}{\beta }$ is the estimated coefficient vector.

• p is the number of coefficients in the regression model.

• hii is the ith diagonal element of the Hat Matrix H.

### Deviance

Deviance of a model M1 is twice the difference between the loglikelihood of that model and the saturated model, MS. The saturated model is the model with the maximum number of parameters that can be estimated. For example, if there are n observations yi, i = 1, 2, ..., n, with potentially different values for XiTβ, then you can define a saturated model with n parameters. Let L(b,y) denote the maximum value of the likelihood function for a model. Then the deviance of model M1 is

`$-2\left(\mathrm{log}L\left({b}_{1},y\right)-\mathrm{log}L\left({b}_{S},y\right)\right),$`

where b1 are the estimated parameters for model M1 and bS are the estimated parameters for the saturated model. The deviance has a chi-square distribution with np degrees of freedom, where n is the number of parameters in the saturated model and p is the number of parameters in model M1.

If M1 and M2 are two different generalized linear models, then the fit of the models can be assessed by comparing the deviances D1 and D2 of these models. The difference of the deviances is

`$\begin{array}{l}D={D}_{2}-{D}_{1}=-2\left(\mathrm{log}L\left({b}_{2},y\right)-\mathrm{log}L\left({b}_{S},y\right)\right)+2\left(\mathrm{log}L\left({b}_{1},y\right)-\mathrm{log}L\left({b}_{S},y\right)\right)\\ \text{ }\text{ }\text{ }\text{\hspace{0.17em}}\text{\hspace{0.17em}}=-2\left(\mathrm{log}L\left({b}_{2},y\right)-\mathrm{log}L\left({b}_{1},y\right)\right).\end{array}$`

Asymptotically, this difference has a chi-square distribution with degrees of freedom v equal to the number of parameters that are estimated in one model but fixed (typically at 0) in the other. That is, it is equal to the difference in the number of parameters estimated in M1 and M2. You can get the p-value for this test using `1 - chi2cdf(D,V)`, where D = D2D1.

## Copy Semantics

Value. To learn how value classes affect copy operations, see Copying Objects in the MATLAB® documentation.

## Examples

expand all

### Fit a Generalized Linear Model

Fit a logistic regression model of probability of smoking as a function of age, weight, and sex, using a two-way interactions model.

Load the `hospital` dataset array.

```load hospital ds = hospital; % just to use the ds name ```

Specify the model using a formula that allows up to two-way interactions.

```modelspec = 'Smoker ~ Age*Weight*Sex - Age:Weight:Sex'; ```

Create the generalized linear model.

```mdl = fitglm(ds,modelspec,'Distribution','binomial') ```
```mdl = Generalized Linear regression model: logit(Smoker) ~ 1 + Sex*Age + Sex*Weight + Age*Weight Distribution = Binomial Estimated Coefficients: Estimate SE tStat pValue ___________ _________ ________ _______ (Intercept) -6.0492 19.749 -0.3063 0.75938 Sex_Male -2.2859 12.424 -0.18399 0.85402 Age 0.11691 0.50977 0.22934 0.81861 Weight 0.031109 0.15208 0.20455 0.83792 Sex_Male:Age 0.020734 0.20681 0.10025 0.92014 Sex_Male:Weight 0.01216 0.053168 0.22871 0.8191 Age:Weight -0.00071959 0.0038964 -0.18468 0.85348 100 observations, 93 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 5.07, p-value = 0.535 ```

The large -value indicates the model might not differ statistically from a constant.

### Create a Generalized Linear Model Stepwise

Create response data using just three of 20 predictors, and create a generalized linear model stepwise to see if it uses just the correct predictors.

Create data with 20 predictors, and Poisson response using just three of the predictors, plus a constant.

```rng default % for reproducibility X = randn(100,20); mu = exp(X(:,[5 10 15])*[.4;.2;.3] + 1); y = poissrnd(mu); ```

Fit a generalized linear model using the Poisson distribution.

```mdl = stepwiseglm(X,y,... 'constant','upper','linear','Distribution','poisson') ```
```1. Adding x5, Deviance = 134.439, Chi2Stat = 52.24814, PValue = 4.891229e-13 2. Adding x15, Deviance = 106.285, Chi2Stat = 28.15393, PValue = 1.1204e-07 3. Adding x10, Deviance = 95.0207, Chi2Stat = 11.2644, PValue = 0.000790094 mdl = Generalized Linear regression model: log(y) ~ 1 + x5 + x10 + x15 Distribution = Poisson Estimated Coefficients: Estimate SE tStat pValue ________ ________ ______ __________ (Intercept) 1.0115 0.064275 15.737 8.4217e-56 x5 0.39508 0.066665 5.9263 3.0977e-09 x10 0.18863 0.05534 3.4085 0.0006532 x15 0.29295 0.053269 5.4995 3.8089e-08 100 observations, 96 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 91.7, p-value = 9.61e-20 ```