Dummy variable coding in mixed models (LME)

Question

Paul on 10 Dec 2018

4
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/434887-dummy-variable-coding-in-mixed-models-lme

Commented: Paul on 16 Feb 2024

Hi all,

I've been a little perplexed by the different ways to code dummy variables when fitting a linear mixed model (using fitlme). Specifically I have a model with two categorical fixed factors. I'd like to do contrasts between the different levels of each factor. Now several online sources tell me I should use 'effects' coding for this, but it isn't clear to me why this is, nor is it clear to me how I should code my contrast matrix when using 'effects' coding rather than reference coding.

For example, if I have a model with an intercept and one categorical fixed factor with three levels, such that:

T       = table(y,var1); % y is my response variable, var1 a categorical variable with three levels
formula = 'y ~ var1';
m1      = fitlme(T,formula,'DummyVarCoding', 'reference');
me1     = fitlme(T,formula,'DummyVarCoding', 'effects');

Now I'd like to test whether there's a difference between the first and second categories. In the case of the m1 (reference coding) and me1 (effects coding) I would do this as follows:

[p,F,df1,df2]     = coefTest(m1,[0 1 0]); % coefTest(lme,H)
[pe,Fe,df1e,df2e] = coefTest(me1,[1 1 0]);

This gives me very different results even though in both cases multiplying each contrast matrix with the associated fixed effects matrix gives me the same value (according to the documentation for coefTest: "It tests the null hypothesis that H0: Hβ = 0, where β is the fixed-effects vector."

When I try both options with some simulated data, where the first two levels of the fixed factor differ significantly, I only find this significant difference when using reference coding. Any insight would be greatly appreciated.

Here's the code for some simulated data:

d1 = zeros(100,1)+randn(100,1);
d2 = ones(100,1)+randn(100,1);
d3 = ones(100,1).*2+randn(100,1);
d = [d1;d2;d3];
cov1 = [zeros(100,1);ones(100,1);ones(100,1).*2];
T = table(d,categorical(cov1),'VariableNames',{'d','cov1'});
f = 'd~cov1';
m1 = fitlme(T,f);
me1 = fitlme(T,f,'DummyVarCoding','effects');
[p,F,df1,df2]     = coefTest(m1,[0 1 0]); % coefTest(lme,H)
[pe,Fe,df1e,df2e] = coefTest(me1,[1 1 0]);

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Paul on 24 Jul 2020

2
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/434887-dummy-variable-coding-in-mixed-models-lme#answer_470368

Edited: Paul on 24 Jul 2020

Hi! Yes I solved it.

The problem was that, in the example above, I calculated the H vector incorrectly. This is the vector [0 1 0] which is input to the coefTest() function and which tells the function which contrast to perform.

To compute correct H vectors we need to first look at the model coefficients. For the dummy coded model these are:

(intercept)

Cov_1

Cov_2

And for the effects coded model these are:

(intercept)

Cov_0

Cov_1

To compute the H vector for a contrast, you need to make a vector for each of the two conditions you want to contrast and subtract those vectors from each other. In this case we want to compare condition 0 with condition 1.

For the dummy coded model condition 0 is represented by only the intercept, because in dummy coding the intercept is set equal to the estimate of the first condition, in other words the intercept = Cov_0, so this vector = [1 0 0]. Condition 1 is represented by the intercept + the coefficient Cov_1, so the vector = [1 1 0]. The difference between these vectors: H= [0 -1 0]. You could flip the order of the subtraction around: H = [0 1 0] would be equivalent.

For the effects coded model the intercept does NOT represent condition 0. Instead, condition 0 is now represented by the intercept + the coefficient Cov_0, so this vector = [1 1 0]. The second condition is equal to the intercept + Cov_1, just like in the dummy coded model, but now this vector = [1 0 1]. The difference between these two is H = [0 1 -1] or [0 -1 1].

Using the new vectors both coefTest(m1, [0 1 0]) and coefTest(me1, [0 1 -1]) return the same output.

I personally prefer using dummy coding for doing contrasts, because I find it a bit easier to intuit the H vector, but in principle they are entirely equivalent. When using the anova() function though, you NEED to use the effects coded model to get accurate statistics.

However, you might wonder how to compute contrasts with the third condition, represented by Cov_2, when using effects coding. Cov_2 is not explicitly mentioned in the effects coded model, so how does this work? Well condition 2 is represented by the vector [1 -1 -1], so a contrast between condition 1 and 2 would give H = [1 0 1] - [1 -1 -1] = [0 1 2] when using effects coding. Whereas using dummy coding it would give H = [1 1 0] - [1 0 1] = [0 1 -1].

Another thing I found hard to figure out because the Matlab documentation doesn't discuss it, is how to compare groups of conditions, for instance when you have nested categories. I thought it might be helpful to mention here as well. For example imagine you want to test the efficacy of two diets on weight loss and you have data from both men and women. You would have four categories: weight loss with diet 1 and weight loss with diet 2 for both men and women. You want to study the interaction between gender and type of diet. Your model would be: 'Weight loss ~ DietType * Gender'.

Your dummy coded model would have the following coefficients:

(intercept) < which represents the category Diet_0 + Gender_0

Diet_1

Gender_1

Diet_1*Gender_1

If you want to see whether there's a difference between the first and second diet, the procedure is to add the vectors for each individual category within a group, and then subtract the two group vectors, as follows:

(v1 + v2) - (v3 + v4) = H

Here v1 represents Gender_0 on Diet_0 (just the intercept), v2 represents Gender_1 on Diet_0, v3 represents Gender_0 on Diet_1 and v4 represents Gender_1 on Diet_1. The calculation would be:

([1 0 0 0] + [1 0 1 0]) - ([1 1 0 0]+[1 1 1 1]) = [0 -2 0 -1] = H

If you want to compare efficacy of dieting between men and women, you would say H =

([1 0 0 0] + [1 1 0 0]) - ([1 0 1 0] + [1 1 1 1]) = [0 0 -2 -1]

4 Comments
Show 2 older commentsHide 2 older comments

Paul on 25 Jul 2020

Edited: Paul on 25 Jul 2020

Hi Xue,

Discovering LMEs with Matlab has helped me a lot in my work but it was a frustrating proces to figure out the details. I'm happy to help others avoid some of that frustration.

No, in the effects coded model Cov_2 is NOT the intercept. In the effects coded model the intercept is not equal to any of your groups/conditions, but rather lies somewhere in between them. In my example the estimate of Cov_2 in the effects coded model uses [1 -1 -1]. This should give the same value as using the vector [1 0 1] in the dummy coded model. This is why interpreting a dummy coded model is (for me at least) a bit more intuitive than interpreting an effects coded model. I also mention this in the paragraph above: "However, you might wonder..."
The function anova() only works properly with an effects coded model because it ASSUMES you are using an effects coded model as input.

I'm no expert on the exact mathematics, but this is related to one specific characteristic of effects coded models, namely that if you sum all the vectors related to all possible conditions they should produce all zeros, except at the intercept. So in the above example there are three conditions: Cov_0, Cov_1, Cov_2. In the effects coded model these are associated with the vectors [1 1 0], [1 0 1] and [1 -1 -1]. If you sum those all together, the result is [3 0 0]. All elements are 0 except for the position of the intercept (3). This is always the case for effects coded models no matter how large, and it is this specific characteristic of the effects coded model which anova() requires to work.

In the dummy coded model the three conditions are associated with [1 0 0], [1 1 0] and [1 0 1], which sum to [3 1 1]. So they do not sum to zero, obviously.

So important to understand is that the difference between a dummy coded model and an effects coded model is mostly in what the point of reference (intercept) is and how the other coefficients are related to the intercept. If you have a model with only two conditions, in a dummy coded model your intercept would be equal to condition 1 (which would be associated with vector [1 0]) and the only coefficient would be the difference between condition 1/intercept and condition 2 (which would be associated with vector [1 1]). In an effects coded model the intercept would be the halfway point between condition 1 and coindtion 2, and the only coefficient for the model would need to be added to the intercept to produce the estimate of condition 1 ([1 1]) and subtracted from the intercept to produce condition 2 ([1 -1]). Because that is the only difference, dummy coded and effects coded models produce identical fits, identical confidence intervals, identical log-likelihoods, identical Rsquared etc etc etc.

Let me know if you have any more questions.

LM_BU on 15 Feb 2024

Hi, @Paul, I am not sure how you have arrived at the conclusion that the intercept must be flagged as 1 for each level, i.e., why Cov_0 is [1 1 0].

Based on the MATLAB documentation of fitglme coefTest, where they also use effect coding, they have Supplier B and Supplier C. When you add them together to find the main effect of supplier as [0,0,0,0,1,0;0 0 0 0 0 1], the p value is in agreement with the anova results (i.e., anova(glme)).

If we do [1,0,0,0,1,0;1 0 0 0 0 1], where intercept is also added, the p value is wildly different. To be honest, I'm not sure what this contrast matrix is supposed to be testing.

Can you please clarify? Thanks!

Paul on 16 Feb 2024

Hi LM_BU,

I don't think I said that "Cov_0 is [1 1 0]". I did say that the first condition is represented by "intercept + Cov_0 is [1 1 0]". In effects coding, every condition contains the intercept; that's part of the definition of the effects coding and as far as I can tell this is also in line with the contents of the documentation page you linked.

I do not see the example you mention (of testing the main effect of supplier) ont he page you linked. Are we both looking at the docs for 2023b?

I do agree with you that when testing for the main effect of supplier, you should leave out the intercept, because you are interested in the effect of supplier, not of the intercept.

In my examples above, I am mostly talking about testing between conditions, which is different from testing main effects. When testing between conditions, you subtract the effects vectors from the two conditions. Since a condition in effects coding always includes the intercept, subtracting effects vectors of any two conditions will always produce an H vector with the intercept element == 0. Example: if you contrast condition 1 [1 0 0] with condition 2 [1 1 0], your H vector will be [1 0 0] - [1 1 0] = [0 -1 0].

Sign in to comment.