Fit a hierarchical multinomial regression model.

Navigate to the folder containing sample data.

cd(matlabroot)
cd('help/toolbox/stats/examples')

Load the sample data.

load smoking

The data set `smoking` contains five variables:
sex, age, weight, and systolic and diastolic blood pressure. Sex is
a binary variable where 1 indicates female patients, and 0 indicates
male patients.

Define the response variable.

Y = categorical(smoking.Smoker);

The data in `Smoker` has four categories:

0: Nonsmoker, 0 cigarettes a day

1: Smoker, 1–5 cigarettes a day

2: Smoker, 6–10 cigarettes a day

3: Smoker, 11 or more cigarettes a day

Define the predictor variables.

X = [smoking.Sex smoking.Age smoking.Weight...
smoking.SystolicBP smoking.DiastolicBP];

Fit a hierarchical multinomial model.

[B,dev,stats] = mnrfit(X,Y,'model','hierarchical');
B

B =
43.8148 5.9571 44.0712
1.8709 -0.0230 0.0662
0.0188 0.0625 0.1335
0.0046 -0.0072 -0.0130
-0.2170 0.0416 -0.0324
-0.2273 -0.1449 -0.4824

The first column of `B` includes the intercept
and the coefficient estimates for the model of the relative risk of
being a nonsmoker versus a smoker. The second column includes the
parameter estimates for modeling the log odds of smoking 1–5
cigarettes a day versus more than five cigarettes a day given that
a person is a smoker. Finally, the third column includes the parameter
estimates for modeling the log odds of a person smoking 6–10
cigarettes a day versus more than 10 cigarettes a day given he/she
smokes more than 5 cigarettes a day.

The coefficients differ across categories. You can specify this
using the `'interactions','on'` name-value pair argument,
which is the default for hierarchical models. So, the model in this
example is

$$\begin{array}{l}\mathrm{ln}\left(\frac{P\left(y=0\right)}{P\left(y>0\right)}\right)=43.8148+1.8709{X}_{S}+0.0188{X}_{A}+0.0046{X}_{W}-0.2170{X}_{SBP}-0.2273{X}_{DBP}\\ \\ \mathrm{ln}\left(\frac{P\left(1\le y\le 5\right)}{P\left(y>5\right)}\right)=5.9571-0.0230{X}_{S}+0.0625{X}_{A}+0.0072{X}_{W}+0.0416{X}_{SBP}-0.1449{X}_{DBP}\\ \\ \mathrm{ln}\left(\frac{P\left(6\le y\le 10\right)}{P\left(y>10\right)}\right)=44.0712+0.0662{X}_{S}+0.1335{X}_{A}-0.0130{X}_{W}-0.0324{X}_{SBP}-0.4824{X}_{DBP}\end{array}$$

For example, the coefficient estimate of 1.8709 indicates that
the likelihood of being a smoker versus a nonsmoker increases by exp(1.8709)
= 6.49 times as the gender changes from female to male given everything
else held constant.

Assess the statistical significance of the terms.

stats.p

ans =
0.0000 0.5363 0.2149
0.3549 0.9912 0.9835
0.6850 0.2676 0.2313
0.9032 0.8523 0.8514
0.0009 0.5187 0.8165
0.0004 0.0483 0.0545

Sex, age, or weight don't appear significant on any level.
The *p*-values of 0.0009 and 0.0004 indicate that
both types of blood pressure are significant on the relative risk
of a person being a smoker versus a nonsmoker. The *p*-value
of 0.0483 shows that only diastolic blood pressure is significant
on the odds of a person smoking 0–5 cigarettes a day versus
more than 5 cigarettes a day. Similarly, the *p*-value
of 0.0545 indicates that diastolic blood pressure is significant on
the odds of a person smoking 6–10 cigarettes a day versus more
than 10 cigarettes a day.

Check if any nonsignificant factors are correlated to
each other. Draw a scatterplot of age versus weight grouped by sex.

figure()
gscatter(smoking.Age,smoking.Weight,smoking.Sex)
legend('Male','Female')
xlabel('Age')
ylabel('Weight')

The range of weight of an individual seems to differ according
to gender. Age does not seem to have any obvious correlation with
sex or weight. Age is insignificant and weight seems to be correlated
with sex, so you can eliminate both and reconstruct the model.

Eliminate age and weight from the model and fit a hierarchical
model with sex, systolic blood pressure, and diastolic blood pressure
as the predictor variables.

X = double([smoking.Sex smoking.SystolicBP...
smoking.DiastolicBP]);
[B,dev,stats] = mnrfit(X,Y,'model','hierarchical');
B

B =
44.8456 5.3230 25.0248
1.6045 0.2330 0.4982
-0.2161 0.0497 0.0179
-0.2222 -0.1358 -0.3092

Here, a coefficient estimate of 1.6045 indicates that the likelihood
of being a nonsmoker versus a smoker increases by exp(1.6045) = 4.97
times as sex changes from male to female. A unit increase in the systolic
blood pressure indicates an exp(–.2161) = 0.8056 decrease in
the likelihood of being a nonsmoker versus a smoker. Similarly, a
unit increase in the diastolic blood pressure indicates an exp(–.2222)
= 0.8007 decrease in the relative rate of being a nonsmoker versus
being a smoker.

Assess the statistical significance of the terms.

stats.p

ans =
0.0000 0.4715 0.2325
0.0210 0.7488 0.6362
0.0010 0.4107 0.8899
0.0003 0.0483 0.0718

The *p*-values of 0.0210, 0.0010, and 0.0003
indicate that the terms sex and both types of blood pressure are significant
on the relative risk of a person being a nonsmoker versus a smoker,
given the other terms in the model. Based on the *p*-value
of 0.0483, diastolic blood pressure appears significant on the relative
risk of a person smoking 1–5 cigarettes versus more than 5
cigarettes a day, given that this person is a smoker. Because none
of the *p*-values on the third column are less than
0.05, you can say that none of the variables are statistically significant
on the relative risk of a person smoking from 6–10 cigarettes
versus more than 10 cigarettes, given that this person smokes more
than 5 cigarettes a day.