MATLAB Answers

Dummy variable coding in mixed models (LME)

52 views (last 30 days)
Paul
Paul on 10 Dec 2018
Edited: Paul on 25 Jul 2020 at 10:44
Hi all,
I've been a little perplexed by the different ways to code dummy variables when fitting a linear mixed model (using fitlme). Specifically I have a model with two categorical fixed factors. I'd like to do contrasts between the different levels of each factor. Now several online sources tell me I should use 'effects' coding for this, but it isn't clear to me why this is, nor is it clear to me how I should code my contrast matrix when using 'effects' coding rather than reference coding.
For example, if I have a model with an intercept and one categorical fixed factor with three levels, such that:
T = table(y,var1); % y is my response variable, var1 a categorical variable with three levels
formula = 'y ~ var1';
m1 = fitlme(T,formula,'DummyVarCoding', 'reference');
me1 = fitlme(T,formula,'DummyVarCoding', 'effects');
Now I'd like to test whether there's a difference between the first and second categories. In the case of the m1 (reference coding) and me1 (effects coding) I would do this as follows:
[p,F,df1,df2] = coefTest(m1,[0 1 0]); % coefTest(lme,H)
[pe,Fe,df1e,df2e] = coefTest(me1,[1 1 0]);
This gives me very different results even though in both cases multiplying each contrast matrix with the associated fixed effects matrix gives me the same value (according to the documentation for coefTest: "It tests the null hypothesis that H0: Hβ = 0, where β is the fixed-effects vector."
When I try both options with some simulated data, where the first two levels of the fixed factor differ significantly, I only find this significant difference when using reference coding. Any insight would be greatly appreciated.
Here's the code for some simulated data:
d1 = zeros(100,1)+randn(100,1);
d2 = ones(100,1)+randn(100,1);
d3 = ones(100,1).*2+randn(100,1);
d = [d1;d2;d3];
cov1 = [zeros(100,1);ones(100,1);ones(100,1).*2];
T = table(d,categorical(cov1),'VariableNames',{'d','cov1'});
f = 'd~cov1';
m1 = fitlme(T,f);
me1 = fitlme(T,f,'DummyVarCoding','effects');
[p,F,df1,df2] = coefTest(m1,[0 1 0]); % coefTest(lme,H)
[pe,Fe,df1e,df2e] = coefTest(me1,[1 1 0]);

  0 Comments

Sign in to comment.

Accepted Answer

Paul
Paul on 24 Jul 2020 at 10:40
Edited: Paul on 24 Jul 2020 at 10:44
Hi! Yes I solved it.
The problem was that, in the example above, I calculated the H vector incorrectly. This is the vector [0 1 0] which is input to the coefTest() function and which tells the function which contrast to perform.
To compute correct H vectors we need to first look at the model coefficients. For the dummy coded model these are:
(intercept)
Cov_1
Cov_2
And for the effects coded model these are:
(intercept)
Cov_0
Cov_1
To compute the H vector for a contrast, you need to make a vector for each of the two conditions you want to contrast and subtract those vectors from each other. In this case we want to compare condition 0 with condition 1.
For the dummy coded model condition 0 is represented by only the intercept, because in dummy coding the intercept is set equal to the estimate of the first condition, in other words the intercept = Cov_0, so this vector = [1 0 0]. Condition 1 is represented by the intercept + the coefficient Cov_1, so the vector = [1 1 0]. The difference between these vectors: H= [0 -1 0]. You could flip the order of the subtraction around: H = [0 1 0] would be equivalent.
For the effects coded model the intercept does NOT represent condition 0. Instead, condition 0 is now represented by the intercept + the coefficient Cov_0, so this vector = [1 1 0]. The second condition is equal to the intercept + Cov_1, just like in the dummy coded model, but now this vector = [1 0 1]. The difference between these two is H = [0 1 -1] or [0 -1 1].
Using the new vectors both coefTest(m1, [0 1 0]) and coefTest(me1, [0 1 -1]) return the same output.
I personally prefer using dummy coding for doing contrasts, because I find it a bit easier to intuit the H vector, but in principle they are entirely equivalent. When using the anova() function though, you NEED to use the effects coded model to get accurate statistics.
However, you might wonder how to compute contrasts with the third condition, represented by Cov_2, when using effects coding. Cov_2 is not explicitly mentioned in the effects coded model, so how does this work? Well condition 2 is represented by the vector [1 -1 -1], so a contrast between condition 1 and 2 would give H = [1 0 1] - [1 -1 -1] = [0 1 2] when using effects coding. Whereas using dummy coding it would give H = [1 1 0] - [1 0 1] = [0 1 -1].
Another thing I found hard to figure out because the Matlab documentation doesn't discuss it, is how to compare groups of conditions, for instance when you have nested categories. I thought it might be helpful to mention here as well. For example imagine you want to test the efficacy of two diets on weight loss and you have data from both men and women. You would have four categories: weight loss with diet 1 and weight loss with diet 2 for both men and women. You want to study the interaction between gender and type of diet. Your model would be: 'Weight loss ~ DietType * Gender'.
Your dummy coded model would have the following coefficients:
(intercept) < which represents the category Diet_0 + Gender_0
Diet_1
Gender_1
Diet_1*Gender_1
If you want to see whether there's a difference between the first and second diet, the procedure is to add the vectors for each individual category within a group, and then subtract the two group vectors, as follows:
(v1 + v2) - (v3 + v4) = H
Here v1 represents Gender_0 on Diet_0 (just the intercept), v2 represents Gender_1 on Diet_0, v3 represents Gender_0 on Diet_1 and v4 represents Gender_1 on Diet_1. The calculation would be:
([1 0 0 0] + [1 0 1 0]) - ([1 1 0 0]+[1 1 1 1]) = [0 -2 0 -1] = H
If you want to compare efficacy of dieting between men and women, you would say H =
([1 0 0 0] + [1 1 0 0]) - ([1 0 1 0] + [1 1 1 1]) = [0 0 -2 -1]

  2 Comments

Xue Zhang
Xue Zhang on 25 Jul 2020 at 1:37
Hi Paul,
Thanks for your detailed response from which I have benefited a lot. I have two questions, both are related to the effects coding:
  1. From my understanding, the intercept in effects model is the Cov_2, which is similar to the Cov_0 in the dummy model. Both (Cov_2 and Cov_1) are omitted and treated as reference respectively. So the condition2 is [1 0 0] instead of [1 -1 -1] in effects model?
  2. Why would the effects coded model provide more accurate statistics from ANOVA? Again I think the difference between reference and effects coded model is the choice of baseline condition, which is Cov_0 in reference and Cov_2 in effects model.
My problem might be that I misunderstood the effects model.
I look forward to your replies! Thank you so much!
Paul
Paul on 25 Jul 2020 at 10:39
Hi Xue,
Discovering LMEs with Matlab has helped me a lot in my work but it was a frustrating proces to figure out the details. I'm happy to help others avoid some of that frustration.
  1. No, in the effects coded model Cov_2 is NOT the intercept. In the effects coded model the intercept is not equal to any of your groups/conditions, but rather lies somewhere in between them. In my example the estimate of Cov_2 in the effects coded model uses [1 -1 -1]. This should give the same value as using the vector [1 0 1] in the dummy coded model. This is why interpreting a dummy coded model is (for me at least) a bit more intuitive than interpreting an effects coded model. I also mention this in the paragraph above: "However, you might wonder..."
  2. The function anova() only works properly with an effects coded model because it ASSUMES you are using an effects coded model as input.
I'm no expert on the exact mathematics, but this is related to one specific characteristic of effects coded models, namely that if you sum all the vectors related to all possible conditions they should produce all zeros, except at the intercept. So in the above example there are three conditions: Cov_0, Cov_1, Cov_2. In the effects coded model these are associated with the vectors [1 1 0], [1 0 1] and [1 -1 -1]. If you sum those all together, the result is [3 0 0]. All elements are 0 except for the position of the intercept (3). This is always the case for effects coded models no matter how large, and it is this specific characteristic of the effects coded model which anova() requires to work.
In the dummy coded model the three conditions are associated with [1 0 0], [1 1 0] and [1 0 1], which sum to [3 1 1]. So they do not sum to zero, obviously.
So important to understand is that the difference between a dummy coded model and an effects coded model is mostly in what the point of reference (intercept) is and how the other coefficients are related to the intercept. If you have a model with only two conditions, in a dummy coded model your intercept would be equal to condition 1 (which would be associated with vector [1 0]) and the only coefficient would be the difference between condition 1/intercept and condition 2 (which would be associated with vector [1 1]). In an effects coded model the intercept would be the halfway point between condition 1 and coindtion 2, and the only coefficient for the model would need to be added to the intercept to produce the estimate of condition 1 ([1 1]) and subtracted from the intercept to produce condition 2 ([1 -1]). Because that is the only difference, dummy coded and effects coded models produce identical fits, identical confidence intervals, identical log-likelihoods, identical Rsquared etc etc etc.
Let me know if you have any more questions.

Sign in to comment.

More Answers (1)

Xue Zhang
Xue Zhang on 24 Jul 2020 at 8:43
Same question here, have you figured out why? Thanks!

  1 Comment

Paul
Paul on 24 Jul 2020 at 16:10
Yes, see my above comment.

Sign in to comment.