Main Content

# glmfit

Fit generalized linear regression model

## Syntax

``b = glmfit(X,y,distr)``
``b = glmfit(X,y,distr,Name,Value)``
``[b,dev] = glmfit(___)``
``[b,dev,stats] = glmfit(___)``

## Description

````b = glmfit(X,y,distr)` returns a vector `b` of coefficient estimates for a generalized linear regression model of the responses in `y` on the predictors in `X`, using the distribution `distr`.```

example

````b = glmfit(X,y,distr,Name,Value)` specifies additional options using one or more name-value arguments. For example, you can specify `'Constant','off'` to omit the constant term from the model.```

example

````[b,dev] = glmfit(___)` also returns the value `dev`, the deviance of the fit.```
````[b,dev,stats] = glmfit(___)` also returns the model statistics `stats`. ```

## Examples

collapse all

Fit a generalized linear regression model, and compute predicted (estimated) values for the predictor data using the fitted model.

Create a sample data set.

```x = [2100 2300 2500 2700 2900 3100 ... 3300 3500 3700 3900 4100 4300]'; n = [48 42 31 34 31 21 23 23 21 16 17 21]'; y = [1 2 0 3 8 8 14 17 19 15 17 21]';```

`x` contains the predictor variable values. Each `y` value is the number of successes in the corresponding number of trials in `n`.

Fit a probit regression model for `y` on `x`.

`b = glmfit(x,[y n],'binomial','Link','probit');`

Compute the estimated number of successes.

`yfit = glmval(b,x,'probit','Size',n);`

Plot the observed success percent and estimated success percent versus the `x` values.

`plot(x,y./n,'o',x,yfit./n,'-')` Define a custom link function and use it to fit a generalized linear regression model.

Load the sample data.

`load fisheriris`

The column vector `species` contains iris flowers of three different species: setosa, versicolor, and virginica. The matrix `meas` contains four types of measurements for the flowers, the length and width of sepals and petals in centimeters.

Define the predictor variables and response variable.

```X = meas(51:end,:); y = strcmp('versicolor',species(51:end));```

Define a custom link function for a logit link function. Create three function handles that define the link function, the derivative of the link function, and the inverse link function. Store them in a cell array.

```link = @(mu) log(mu./(1-mu)); derlink = @(mu) 1./(mu.*(1-mu)); invlink = @(resp) 1./(1+exp(-resp)); F = {link,derlink,invlink};```

Fit a logistic regression model using `glmfit` with the custom link function.

`b = glmfit(X,y,'binomial','link',F)`
```b = 5×1 42.6378 2.4652 6.6809 -9.4294 -18.2861 ```

Fit a generalized linear model by using the built-in `logit` link function, and compare the results.

`b = glmfit(X,y,'binomial','link','logit')`
```b = 5×1 42.6378 2.4652 6.6809 -9.4294 -18.2861 ```

Fit a generalized linear regression model that contains an intercept and linear term for each predictor. Perform a deviance test that determines whether the model fits significantly better than a constant model.

Generate sample data using Poisson random numbers with two underlying predictors `X(:,1)` and `X(:,2)`.

```rng('default') % For reproducibility rndvars = randn(100,2); X = [2 + rndvars(:,1),rndvars(:,2)]; mu = exp(1 + X*[1;2]); y = poissrnd(mu);```

Fit a generalized linear regression model that contains an intercept and linear term for each predictor.

`[b,dev] = glmfit(X,y,'poisson');`

The second output argument `dev` is a Deviance of the fit.

Fit a generalized linear regression model that contains only an intercept. Specify the predictor variable as a column of 1s, and specify `'Constant'` as `'off'` so that `glmfit` does not include a constant term in the model.

`[~,dev_noconstant] = glmfit(ones(100,1),y,'poisson','Constant','off');`

Compute the difference between `dev_constant` and `dev`.

`D = dev_noconstant - dev`
```D = 2.9533e+05 ```

`D` has a chi-square distribution with 2 degrees of freedom. The degrees of freedom equal the difference in the number of estimated parameters in the model corresponding to `dev` and the number of estimated parameters in the constant model. Find the p-value for a deviance test.

`p = 1 - chi2cdf(D,2)`
```p = 0 ```

The small p-value indicates that the model differs significantly from a constant.

Alternatively, you can create a generalized linear regression model of Poisson data by using the `fitglm` function. The model display includes the statistic (`Chi^2-statistic vs. constant model`) and p-value.

`mdl = fitglm(X,y,'y ~ x1 + x2','Distribution','poisson')`
```mdl = Generalized linear regression model: log(y) ~ 1 + x1 + x2 Distribution = Poisson Estimated Coefficients: Estimate SE tStat pValue ________ _________ ______ ______ (Intercept) 1.0405 0.022122 47.034 0 x1 0.9968 0.003362 296.49 0 x2 1.987 0.0063433 313.24 0 100 observations, 97 error degrees of freedom Dispersion: 1 Chi^2-statistic vs. constant model: 2.95e+05, p-value = 0 ```

You can also use the `devianceTest` function with the fitted model object.

`devianceTest(mdl)`
```ans=2×4 table Deviance DFE chi2Stat pValue __________ ___ __________ ______ log(y) ~ 1 2.9544e+05 99 log(y) ~ 1 + x1 + x2 107.4 97 2.9533e+05 0 ```

## Input Arguments

collapse all

Predictor variables, specified as an n-by-p numeric matrix, where n is the number of observations and p is the number of predictor variables. Each column of `X` represents one variable, and each row represents one observation.

By default, `glmfit` includes a constant term in the model. Do not add a column of 1s directly to `X`. You can change the default behavior of `glmfit` by specifying the `'Constant'` name-value argument.

Data Types: `single` | `double`

Response variable, specified as a vector or matrix.

• If `distr` is not `'binomial'`, then `y` must be an n-by-1 vector, where n is the number of observations. Each entry in `y` is the response for the corresponding row of `X`. The data type must be single or double.

• If `distr` is `'binomial'`, then `y` is an n-by-1 vector indicating success or failure at each observation, or an n-by-2 matrix whose first column indicates the number of successes for each observation and second column indicates the number of trials for each observation.

Data Types: `single` | `double` | `logical` | `categorical`

Distribution of the response variable, specified as one of the values in this table.

ValueDescription
`'normal'`Normal distribution (default)
`'binomial'`Binomial distribution
`'poisson'`Poisson distribution
`'gamma'`Gamma distribution
`'inverse gaussian'`Inverse Gaussian distribution

### Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside quotes. You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

Example: `b = glmfit(X,y,'normal','link','probit')` specifies that the distribution of the response is normal and instructs `glmfit` to use the probit link function.

Initial values for the coefficient estimates, specified as a numeric vector. The default values are initial fitted values derived from the input data.

Data Types: `single` | `double`

Indicator for the constant term (intercept) in the fit, specified as either `'on'` to include the constant term or `'off'` to remove it from the model.

• `'on'` (default) — `glmfit` includes a constant term in the model and returns a (p + 1)-by-1 vector of coefficient estimates `b`, where p is the number of predictors in `X`. The coefficient of the constant term is the first element of `b`.

• `'off'``glmfit` omits the constant term and returns a p-by-1 vector of coefficient estimates `b`.

Example: `'Constant','off'`

Indicator to compute a dispersion parameter for `'binomial'` and `'poisson'` distributions, specified as `'on'` or `'off'`.

ValueDescription
`'on'`Estimate a dispersion parameter when computing standard errors. The estimated dispersion parameter value is the sum of squared Pearson residuals divided by the degrees of freedom for error (DFE).
`'off'`Use the theoretical value of 1 when computing standard errors (default).

The fitting function always estimates the dispersion for other distributions.

Example: `'EstDisp','on'`

Link function to use in place of the canonical link function, specified as one of the built-in link functions in the following table or a custom link function.

Link Function NameLink FunctionMean (Inverse) Function
`'identity'` (default for `'normal'` distribution)f(μ) = μμ = Xb
`'log'` (default for `'poisson'` distribution)f(μ) = log(μ)μ = exp(Xb)
`'logit'` (default for `'binomial'` distribution)f(μ) = log(μ/(1 – μ))μ = exp(Xb) / (1 + exp(Xb))
`'probit'`f(μ) = Φ–1(μ), where Φ is the cumulative distribution function of the standard normal distributionμ = Φ(Xb)

`'loglog'`

f(μ) = log(–log(μ))μ = exp(–exp(Xb))
`'comploglog'`f(μ) = log(–log(1 – μ))μ = 1 – exp(–exp(Xb))
`'reciprocal'` (default for `'gamma'` distribution)f(μ) = 1/μμ = 1/(Xb)
`p` (a number, default for the ```'inverse gaussian'``` distribution with p = –2)f(μ) = μpμ = Xb1/p

The default `'Link'` value is the canonical link function, which depends on the distribution of the response variable specified by the `distr` argument.

You can specify a custom link function using a structure or cell array.

• Structure with three fields. Each field of the structure (for example, `S`) holds a function handle that accepts a vector of inputs and returns a vector of the same size:

• `S.Link` — Link function, f(μ) = `S.Link`(μ)

• `S.Derivative` — Derivative of the link function

• `S.Inverse` — Inverse link function, μ = `S.Inverse`(Xb)

• Cell array of the form `{FL FD FI}` that defines the link function (`FL(mu)`), the derivative of the link function (`FD = dFL(mu)/dmu`), and the inverse link function (`FI = FL^(–1)`). Each entry holds a function handle that accepts a vector of inputs and returns a vector of the same size.

The link function defines the relationship f(μ) = X*b between the mean response μ and the linear combination of predictors X*b.

Example: `'Link','probit'`

Data Types: `single` | `double` | `char` | `string` | `struct` | `cell`

Offset variable in the fit, specified as a numeric vector with the same length as the response `y`.

`glmfit` uses `Offset` as an additional predictor with a coefficient value fixed at 1. In other words, the formula for fitting is

f(μ)``` = Offset + X*b```,

where f is the link function, μ is the mean response, and X*b is the linear combination of predictors X. The `Offset` predictor has coefficient `1`.

For example, consider a Poisson regression model. Suppose, for theoretical reasons, the number of counts is to be proportional to a predictor `A`. By using the log link function and specifying `log(A)` as an offset, you can force the model to satisfy this theoretical constraint.

Data Types: `single` | `double`

Optimization options, specified as a structure. This argument determines the control parameters for the iterative algorithm that `glmfit` uses.

Create the `'Options'` value by using the function `statset` or by creating a structure array containing the fields and values described in this table.

Field NameValueDefault Value
`Display`

Amount of information displayed by the algorithm

• `'off'` — Displays no information

• `'final'` — Displays the final output

`'off'`
`MaxIter`

Maximum number of iterations allowed, specified as a positive integer

`100`
`TolX`

Termination tolerance for the parameters, specified as a positive scalar

`1e-6`

You can also enter `statset('glmfit')` in the Command Window to see the names and default values of the fields that `glmfit` accepts in the `'Options'` name-value argument.

Example: `'Options',statset('Display','final','MaxIter',1000)` specifies to display the final information of the iterative algorithm results, and change the maximum number of iterations allowed to 1000.

Data Types: `struct`

Observation weights, specified as an n-by-1 vector of nonnegative scalar values, where n is the number of observations.

Data Types: `single` | `double`

## Output Arguments

collapse all

Coefficient estimates, returned as a numeric vector.

• If `'Constant'` is `'on'` (default), then `glmfit` includes a constant term in the model and returns a (p + 1)-by-1 vector of coefficient estimates `b`, where p is the number of predictors in `X`. The coefficient of the constant term is the first element of `b`.

• If `'Constant'` is `'off'`, then `glmfit` omits the constant term and returns a p-by-1 vector of coefficient estimates `b`.

Deviance of the fit, returned as a numeric value. The deviance is useful for comparing two models when one model is a special case of the other model. The difference between the deviance of the two models has a chi-square distribution with degrees of freedom equal to the difference in the number of estimated parameters between the two models.

For more information, see Deviance.

Model statistics, returned as a structure with the following fields:

• `beta` — Coefficient estimates `b`

• `dfe` — Degrees of freedom for error

• `sfit` — Estimated dispersion parameter

• `s` — Theoretical or estimated dispersion parameter

• `estdisp` — 0 when `'EstDisp'` is `'off'` and 1 when `'EstDisp'` is `'on'`

• `covb` — Estimated covariance matrix for `b`

• `se` — Vector of standard errors of the coefficient estimates `b`

• `coeffcorr` — Correlation matrix for `b`

• `t`t statistics for `b`

• `p`p-values for `b`

• `resid` — Vector of residuals

• `residp` — Vector of Pearson residuals

• `residd` — Vector of deviance residuals

• `resida` — Vector of Anscombe residuals

If you estimate a dispersion parameter for the binomial or Poisson distribution, then `stats.s` is equal to `stats.sfit`. Also, the elements of `stats.se` differ by the factor `stats.s` from their theoretical values.

## More About

collapse all

### Deviance

Deviance is a generalization of the residual sum of squares. It measures the goodness of fit compared to a saturated model.

Deviance of a model M1 is twice the difference between the loglikelihood of the model M1 and the saturated model Ms. A saturated model is a model with the maximum number of parameters that you can estimate.

For example, if you have n observations (yi, i = 1, 2, ..., n) with potentially different values for XiTβ, then you can define a saturated model with n parameters. Let L(b,y) denote the maximum value of the likelihood function for a model with the parameters b. Then the deviance of the model M1 is

`$-2\left(\mathrm{log}L\left({b}_{1},y\right)-\mathrm{log}L\left({b}_{S},y\right)\right),$`

where b1 and bs contain the estimated parameters for the model M1 and the saturated model, respectively. The deviance has a chi-square distribution with np degrees of freedom, where n is the number of parameters in the saturated model and p is the number of parameters in the model M1.

Assume you have two different generalized linear regression models M1 and M2, and M1 has a subset of the terms in M2. You can assess the fit of the models by comparing the deviances D1 and D2 of the two models. The difference of the deviances is

`$\begin{array}{l}D={D}_{2}-{D}_{1}=-2\left(\mathrm{log}L\left({b}_{2},y\right)-\mathrm{log}L\left({b}_{S},y\right)\right)+2\left(\mathrm{log}L\left({b}_{1},y\right)-\mathrm{log}L\left({b}_{S},y\right)\right)\\ \text{ }\text{ }\text{ }\text{\hspace{0.17em}}\text{\hspace{0.17em}}=-2\left(\mathrm{log}L\left({b}_{2},y\right)-\mathrm{log}L\left({b}_{1},y\right)\right).\end{array}$`

Asymptotically, the difference D has a chi-square distribution with degrees of freedom v equal to the difference in the number of parameters estimated in M1 and M2. You can obtain the p-value for this test by using `1 – chi2cdf(D,v)`.

Typically, you examine D using a model M2 with a constant term and no predictors. Therefore, D has a chi-square distribution with p – 1 degrees of freedom. If the dispersion is estimated, the difference divided by the estimated dispersion has an F distribution with p – 1 numerator degrees of freedom and np denominator degrees of freedom.

## Alternative Functionality

`glmfit` is useful when you simply need the output arguments of the function or when you want to repeat fitting a model multiple times in a loop. If you need to investigate a fitted model further, create a generalized linear regression model object `GeneralizedLinearModel` by using `fitglm` or `stepwiseglm`. A `GeneralizedLinearModel` object provides more features than `glmfit`.

• Use the properties of `GeneralizedLinearModel` to investigate a fitted model. The object properties include information about the coefficient estimates, summary statistics, fitting method, and input data.

• Use the object functions of `GeneralizedLinearModel` to predict responses and to modify, evaluate, and visualize the generalized linear regression model.

• You can find the information in the output of `glmfit` using the properties and object functions of `GeneralizedLinearModel`.

Output of `glmfit`Equivalent Values in `GeneralizedLinearModel`
`b`See the `Estimate` column of the `Coefficients` property.
`dev`See the `Deviance` property.
`stats`

See the model display in the Command Window. You can find the statistics in the model properties (`CoefficientCovariance`, `Coefficients`, `Dispersion`, `DispersionEstimated`, and `Residuals`).

The dispersion parameter in `stats.s` of `glmfit` is the scale factor for the standard errors of coefficients, whereas the dispersion parameter in the `Dispersion` property of a generalized linear model is the scale factor for the variance of the response. Therefore, `stats.s` is the square root of the `Dispersion` value.

 Dobson, A. J. An Introduction to Generalized Linear Models. New York: Chapman & Hall, 1990.

 McCullagh, P., and J. A. Nelder. Generalized Linear Models. New York: Chapman & Hall, 1990.

 Collett, D. Modeling Binary Data. New York: Chapman & Hall, 2002.

## Support

#### Mastering Machine Learning: A Step-by-Step Guide with MATLAB

Download ebook