Prepare Data for Linear Mixed-Effects Models

Tables and Dataset Arrays

To fit a linear-mixed effects model, you must store your data in a table or dataset array. In your table or dataset array, you must have a column for each variable including the response variable. More specifically, the table or dataset array, say tbl, must contain the following:

A response variable y
Predictive variables X_jwhich can be continuous or grouping variables
Grouping variables g₁, g₂, ..., g_R,

where the grouping variables in X_j and g_r can be categorical, logical, a character array, a string array, or a cell array of character vectors, r = 1, 2, ..., R.

You must organize your data so that each row represents an observation. And each row should contain the value of variables and the levels of grouping variables corresponding to that observation. For example, if you have data from an experiment with four treatment options, on five different types of individuals chosen randomly from a population of individuals (blocks), the table or dataset array must look like this.

Block	Treatment	Response
1	1	y11
1	2	y12
1	3	y13
1	4	y14
...	...	...
5	1	y51
5	2	y52
5	3	y53
5	4	y54

Now, consider a split-plot experiment, where the effect of four different types of fertilizers on the yield of tomato plants is studied. The soil where the tomato plants are planted is divided into three blocks based on the soil type: sandy, silty, and loamy. Each block is divided into five plots, where five types of tomato plants, (cherry, heirloom, grape, vine, and plum) are randomly assigned to these plots. Then, the tomato plants in the plots are divided into subplots, where each subplot is treated by one of the four fertilizers. The data from this experiment looks like:

Soil	Tomato	Fertilizer	Yield
'Sandy'	'Plum'	1	104
'Sandy'	'Plum'	2	136
'Sandy'	'Plum'	3	158
'Sandy'	'Plum'	4	174
'Sandy'	'Cherry'	1	57
'Sandy'	'Cherry'	2	86
...	...	...	...
'Sandy'	'Vine'	3	99
'Sandy'	'Vine'	4	117
'Silty'	'Plum'	1	120
'Silty'	'Plum'	2	115
...	...	...	...
'Loamy'	'Vine'	3	111
'Loamy'	'Vine'	4	105

You must specify the model you want to fit using the formula input argument to fitlme.

In general, a formula for model specification is a character vector or string scalar of the form 'y ~ terms'. For linear mixed-effects models, this formula is in the form 'y ~ fixed + (random1|grouping1) + ... + (randomR|groupingR)', where fixed contains the fixed-effects terms and random1, ..., randomR contain the random-effects terms. For example, for the previous fertilizer experiment, consider the following mixed-effects model

$y_{i m j k} = β_{0} + \sum_{m = 2}^{4} β_{1 m} I {[F]}_{i m} + \sum_{j = 2}^{5} β_{2 j} I {[T]}_{i j} + b_{0 k} S_{k} + b_{0 j k} {(S * T)}_{j k} + ε_{i m j k},$

where i = 1, 2, ..., 60, the index m corresponds to the fertilizer types, j corresponds to the tomato types, and k = 1, 2, 3 corresponds to the blocks (soil). S_k represents the kth soil type, and I[F]_im is the dummy variable representing level m of the fertilizer. Similarly, I[T]_ij is the dummy variable representing the level j of the tomato type.

You can fit this model using the formula 'Yield ~ 1 + Fertilizer + Tomato + (1|Soil)+(1|Soil:Tomato)'.

For detailed information on how to specify your model using formula, see Relationship Between Formula and Design Matrices.

Design Matrices

If you cannot easily describe your model using a formula, you can create design matrices to define the fixed and random effects, and fit the model using fitlmematrix(X,y,Z,G). You must create your design matrices as follows.

Fixed-effects and random-effects design matrices X and Z:

Enter a column of 1s for the intercept using ones(n,1), where n is the total number of observations.
If X1 is a continuous variable, then enter X1 as it is in a separate column.
If X1 is a categorical variable with m levels, then there must be m – 1 dummy variables for m – 1 levels of X1 in X.
For example, consider an experiment where you want to study the impact of quality of raw materials from four different providers on the productivity of a production line. If you fit a linear mixed-effects model with intercept and provider as the fixed-effects terms, intercept is the random-effects term, and you use reference contrasts coding, then you must construct your fixed- and random-effects design matrices as follows.
```
D = dummyvar(provider); % Create dummy variables
X = [ones(n,1) D(:,2) D(:,3) D(:,4)];
Z = [ones(n,1)];
```
Because reference contrast coding uses the first provider as the reference, and the model has an intercept, you must use the dummy variables for only the last three providers.
If there is an interaction term of predictor variables X1 and X2, then you must enter a column that you form by elementwise product of the vectors X1 and X2.
For example, if you want to fit a model, where there is an intercept, a continuous treatment factor, a continuous time factor, and their interaction as the fixed-effects in a longitudinal study, and time is the random-effects term, then your fixed- and random-effects design matrices should look like
```
X = [ones(n,1),treatment,time,treatment.*time];
y = response;
Z = [time];
```

Grouping variables G:

There is one column for each grouping variable and a column of elementwise product of the grouping variables in case of a nesting.

For example, if you want to group plots (plot) within blocks (block), then you must add a column of elementwise product of plot by block. More specifically, if you want to fit a model where there is intercept and a continuous treatment factor as the fixed-effects in a split-block experiment, and the intercept and treatment are grouped by the plots nested within blocks, then the design matrices should look like this.

X = [ones(n,1),treatment];
y = response;
Z = [ones(n,1),treatment];
G = [block.*plot];

Suppose in the earlier quality of raw materials example, the raw materials arrive in bulks, and the bulks are nested within providers. If you want to fit a linear mixed-effects model, where intercept is grouped by the bulks within providers, then your design matrices should look like this.

D = dummyvar(provider);
X = [ones(n,1) D(:,2) D(:,3) D(:,4)];
y = response;
Z = ones(n,1);
G = [provider.*bulks];

In the earlier longitudinal study example, if you want to add random effects for intercept and time grouped by subjects that participated in the study, then your design matrices should look like

X = [ones(n,1),treatment,time, treatment.*time];
y = response;
Z = [ones(n,1),time];
G = subject;

Relation of Matrix Form to Tables and Dataset Arrays

fitlme(tbl,formula) and fitlmematrix(X,y,Z,G) are equivalent in functionality, such that

y is the n-by-1 response vector.
X is an n-by-p fixed-effects design matrix. fitlme constructs this from the expression fixed in formula.
Z is an R-by-1 cell array with Z{r} being an n-by-q(r) random-effects design matrix constructed from the rth expression in random in formula, r = 1, 2, ..., R.
G is an R-by-1 cell array with G{r} being an n-by-1 grouping variable, g_r, in formula with M(r) levels or groups.

For example, if tbl is a table or dataset array containing the response variable y, the continuous variables X1 and X2, and the grouping variable g, then to fit a linear mixed-effects model that corresponds to the formula expression 'y ~ X1+ X2+ (X1*X2|g)' using fitlmematrix(X,y,Z,G) the input arguments must correspond to the following:

y = tbl.y
X = [ones(n,1), tbl.X1, tbl.X2]
Z = [ones(n,1), tbl.X1, tbl.X2, tbl.X1.*tbl.X2]
G = tbl.g