Wilkinson Notation

Overview

Wilkinson notation provides a way to describe regression and repeated measures models without specifying coefficient values. This specialized notation identifies the response variable and which predictor variables to include or exclude from the model. You can also include squared and higher-order terms, interaction terms, and grouping variables in the model formula.

Specifying a model using Wilkinson notation provides several advantages:

You can include or exclude individual predictors and interaction terms from the model. For example, using the 'Interactions' name-value pair available in each model fitting functions includes interaction terms for all pairs of variables. Using Wilkinson notation instead allows you to include only the interaction terms of interest.
You can change the model formula without changing the design matrix, if your input data uses the table data type. For example, if you fit an initial model using all the available predictor variables, but decide to remove a variable that is not statistically significant, then you can re-write the model formula to include only the variables of interest. You do not need to make any changes to the input data itself.

Statistics and Machine Learning Toolbox™ offers several model fitting functions that use Wilkinson notation, including:

Linear models (using fitlm and stepwiselm)
Generalized linear models (using fitglm)
Linear mixed-effects models (using fitlme and fitlmematrix)
Generalized linear mixed-effects models (using fitglme)
Repeated measures models (using fitrm)
Cox proportional hazards model (using fitcox)

Formula Specification

A formula for model specification is a character vector or string scalar of the form y ~ terms, where y is the name of the response variable, and terms defines the model using the predictor variable names and the following operators.

Predictor Variables

Predictor Terms in Model	Wilkinson Notation
intercept	`1`
no intercept	`–1`
x₁	`x1`
x₁, x₂	`x1 + x2`
x₁, x₂, x₁x₂	`x1*x2` or `x1 + x2 + x1:x2`
x₁x₂	`x1:x2`
x₁, x₁²	`x1^2`
x₁²	`x1^2 – x1`

Wilkinson notation includes an intercept term in the model by default, even if you do not add 1 to the model formula. To exclude the intercept from the model, use -1 in the formula.

The * operator (for interactions) and the ^ operator (for power and exponents) automatically include all lower-order terms. For example, if you specify x^3, the model will automatically include x³, x², and x. If you want to exclude certain variables from the model, use the – operator to remove the unwanted terms.

Random-Effects and Mixed-Effects Models

For random-effects and mixed-effects models, the formula specification includes the names of the predictor variables and the grouping variables. For example, if the predictor variable x₁ is a random effect grouped by the variable g, then represent this in Wilkinson notation as follows:

(x1 | g)

Repeated Measures Models

For repeated measures models, the formula specification includes all of the repeated measures as responses, and the factors as predictor variables. Specify the response variables for repeated measures models as described in the following table.

Response Terms in Model	Wilkinson Notation
y₁	`y1`
y₁, y₂, y₃	`y1,y2,y3`
y₁, y₂, y₃, y₄, y₅	`y1–y5`

For example, if you have three repeated measures as responses and the factors x₁, x₂, and x₃ as the predictor variables, then you can define the repeated measures model using Wilkinson notation as follows:

y1,y2,y3 ~ x1 + x2 + x3

y1-y3 ~ x1 + x2 + x3

Variable Names

If the input data (response and predictor variables) is stored in a table or dataset array, you can specify the formula using the variable names. For example, load the carsmall sample data. Create a table containing Weight, Acceleration, and MPG. Name each variable using the 'VariableNames' name-value pair argument of the fitting function fitlm. Then fit the following model to the data:

$M P G = β_{0} + β_{1} W e i g h t + β_{2} A c c e l e r a t i o n$

load carsmall
tbl = table(Weight,Acceleration,MPG, ...
    'VariableNames',{'Weight','Acceleration','MPG'});
mdl = fitlm(tbl,'MPG ~ Weight + Acceleration')

mdl = 


Linear regression model:
    MPG ~ 1 + Weight + Acceleration

Estimated Coefficients:
                     Estimate         SE         tStat       pValue  
                    __________    __________    _______    __________

    (Intercept)         45.155        3.4659     13.028    1.6266e-22
    Weight          -0.0082475    0.00059836    -13.783    5.3165e-24
    Acceleration       0.19694       0.14743     1.3359       0.18493


Number of observations: 94, Error degrees of freedom: 91
Root Mean Squared Error: 4.12
R-squared: 0.743,  Adjusted R-Squared: 0.738
F-statistic vs. constant model: 132, p-value = 1.38e-27

The model object display uses the variable names provided in the input table.

If the input data is stored as a matrix, you can specify the formula using default variable names such as y, x1, and x2. For example, load the carsmall sample data. Create a matrix containing the predictor variables Weight and Acceleration. Then fit the following model to the data:

$M P G = β_{0} + β_{1} W e i g h t + β_{2} A c c e l e r a t i o n$

load carsmall
X = [Weight,Acceleration];
y = MPG;
mdl = fitlm(X,y,'y ~ x1 + x2')

mdl = 


Linear regression model:
    y ~ 1 + x1 + x2

Estimated Coefficients:
                    Estimate         SE         tStat       pValue  
                   __________    __________    _______    __________

    (Intercept)        45.155        3.4659     13.028    1.6266e-22
    x1             -0.0082475    0.00059836    -13.783    5.3165e-24
    x2                0.19694       0.14743     1.3359       0.18493


Number of observations: 94, Error degrees of freedom: 91
Root Mean Squared Error: 4.12
R-squared: 0.743,  Adjusted R-Squared: 0.738
F-statistic vs. constant model: 132, p-value = 1.38e-27

The term x1 in the model specification formula corresponds to the first column of the predictor variable matrix X. The term x2 corresponds to the second column of the input matrix. The term y corresponds to the response variable.

Linear Model Examples

Use fitlm and stepwiselm to fit linear models.

Intercept and Two Predictors

For a linear regression model with an intercept and two fixed-effects predictors, such as

$y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + ε_{i},$

specify the model formula using Wilkinson notation as follows:

'y ~ x1 + x2'

No Intercept and Two Predictors

For a linear regression model with no intercept and two fixed-effects predictors, such as

$y_{i} = β_{1} x_{i 1} + β_{2} x_{i 2} + ε_{i},$

specify the model formula using Wilkinson notation as follows:

'y ~ -1 + x1 + x2'

Intercept, Two Predictors, and an Interaction Term

For a linear regression model with an intercept, two fixed-effects predictors, and an interaction term, such as

$y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + β_{3} x_{i 1} x_{i 2} + ε_{i},$

specify the model formula using Wilkinson notation as follows:

'y ~ x1*x2'

'y ~ x1 + x2 + x1:x2'

Intercept, Three Predictors, and All Interaction Effects

For a linear regression model with an intercept, three fixed-effects predictors, and interaction effects between all three predictors plus all lower-order terms, such as

$y_{i} = β_{0} + β_{1} x i_{1} + β_{2} x_{i 2} + β_{3} x_{i 3} + β_{4} x_{1} x_{i 2} + β_{5} x_{1} x_{i 3} + β_{6} x_{2} x_{i 3} + β_{7} x_{i 1} x_{i 2} x_{i 3} + ε_{i},$

specify the model formula using Wilkinson notation as follows:

'y ~ x1*x2*x3'

Intercept, Three Predictors, and Selected Interaction Effects

For a linear regression model with an intercept, three fixed-effects predictors, and interaction effects between two of the predictors, such as

$y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + β_{3} x_{i 3} + β_{4} x_{1} x_{i 2} + ε_{i},$

specify the model formula using Wilkinson notation as follows:

'y ~ x1*x2 + x3'

'y ~ x1 + x2 + x3 + x1:x2'

Intercept, Three Predictors, and Lower-Order Interaction Effects Only

For a linear regression model with an intercept, three fixed-effects predictors, and pairwise interaction effects between all three predictors, but excluding an interaction effect between all three predictors simultaneously, such as

$y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + β_{3} x_{i 3} + β_{4} x_{1} x_{i 2} + β_{5} x_{i 1} x_{i 3} + β_{6} x_{i 2} x_{i 3} + ε_{i},$

specify the model formula using Wilkinson notation as follows:

'y ~ x1*x2*x3 - x1:x2:x3'

Linear Mixed-Effects Model Examples

Use fitlme and fitlmematrix to fit linear mixed-effects models.

Random Effect Intercept, No Predictors

For a linear mixed-effects model that contains a random intercept but no predictor terms, such as

$y_{i m} = β_{0 m},$

where

$β_{0 m} = β_{00} + b_{0 m}, b_{0 m} \sim N (0, σ_{0}^{2})$

and g is the grouping variable with m levels, specify the model formula using Wilkinson notation as follows:

'y ~ (1 | g)'

Random Intercept and Fixed Slope for One Predictor

For a linear mixed-effects model that contains a fixed intercept, random intercept, and fixed slope for the continuous predictor variable, such as

$y_{i m} = β_{0 m} + β_{1} x_{i m},$

where

$β_{0 m} = β_{00} + b_{0 m}, b_{0 m} \sim N (0, σ_{0}^{2})$

and g is the grouping variable with m levels, specify the model formula using Wilkinson notation as follows:

'y ~ x1 + (1 | g)'

Random Intercept and Random Slope for One Predictor

For a linear mixed-effects model that contains a fixed intercept, plus a random intercept and a random slope that have a possible correlation between them, such as

$y_{i m} = β_{0 m} + β_{1 m} x_{i m},$

where

$β_{0 m} = β_{00} + b_{0 m}$

$β_{1 m} = β_{10} + b_{1 m}$

$[\begin{matrix} b_{0 m} \\ b_{1 m} \end{matrix}] \sim N {0, σ^{2} D (θ)}$

and D is a 2-by-2 symmetric and positive semidefinite covariance matrix, parameterized by a variance component vector θ, specify the model formula using Wilkinson notation as follows:

'y ~ x1 + (x1 | g)'

The pattern of the random effects covariance matrix is determined by the model fitting function. To specify the covariance matrix pattern, use the name-value pairs available through fitlme when fitting the model. For example, you can specify the assumption that the random intercept and random slope are independent of one another using the 'CovariancePattern' name-value pair argument in fitlme.

Generalized Linear Model Examples

Use fitglm and stepwiseglm to fit generalized linear models.

In a generalized linear model, the y response variable has a distribution other than normal, but you can represent the model as an equation that is linear in the regression coefficients. Specifying a generalized linear model requires three parts:

Distribution of the response variable
Link function
Linear predictor

The distribution of the response variable and the link function are specified using name-value pair arguments in the fit function fitglm or stepwiseglm.

The linear predictor portion of the equation, which appears on the right side of the ~ symbol in the model specification formula, uses Wilkinson notation in the same way as for the linear model examples.

A generalized linear model models the link function, rather than the actual response, as y. This is reflected in the output display for the model object.

Intercept and Two Predictors

For a generalized linear regression model with an intercept and two predictors, such as

$\log (y_{i}) = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2},$

specify the model formula using Wilkinson notation as follows:

'y ~ x1 + x2'

Generalized Linear Mixed-Effects Model Examples

Use fitglme to fit generalized linear mixed-effects models.

In a generalized linear mixed-effects model, the y response variable has a distribution other than normal, but you can represent the model as an equation that is linear in the regression coefficients. Specifying a generalized linear model requires three parts:

Distribution of the response variable
Link function
Linear predictor

The distribution of the response variable and the link function are specified using name-value pair arguments in the fit function fitglme.

A generalized linear model models the link function as y, not the response itself. This is reflected in the output display for the model object.

The pattern of the random effects covariance matrix is determined by the model fitting function. To specify the covariance matrix pattern, use the name-value pairs available through fitglme when fitting the model. For example, you can specify the assumption that the random intercept and random slope are independent of one another using the 'CovariancePattern' name-value pair argument in fitglme.

Random Intercept and Fixed Slope for One Predictor

For a generalized linear mixed-effects model that contains a fixed intercept, random intercept, and fixed slope for the continuous predictor variable, where the response can be modeled using a Poisson distribution, such as

$\log (y_{i m}) = β_{0} + β_{1} x_{i m} + b_{i},$

where

$b_{i} \sim N (0, σ_{b}^{2})$

and g is the grouping variable with m levels, specify the model formula using Wilkinson notation as follows:

'y ~ x1 + (1 | g)'

Repeated Measures Model Examples

Use fitrm to fit repeated measures models.

One Predictor

For a repeated measures model with five response measurements and one predictor variable, specify the model formula using Wilkinson notation as follows:

'y1-y5 ~ x1'

Three Predictors and an Interaction Term

For a repeated measures model with five response measurements and three predictor variables, plus an interaction between two of the predictor variables, specify the model formula using Wilkinson notation as follows:

'y1-y5 ~ x1*x2 + x3'

References

[1] Wilkinson, G. N., and C. E. Rogers. "Symbolic description of factorial models for analysis of variance." J. Royal Statistics Society 22, pp. 392–399, 1973.