How do I do Regression for multiple subjects

Hi
This a stats question for regression analysis
I have data for 19 subjects who did 8 blocks of a task.
I want to see if data increase "significantly" across blocks.
Which option is correct (statistically speaking)
They give different results!!
Option 1: Fit every subject and then t-test on the coefficient of rate of change (slope)
coeff = [];
for s=1:19
fit = fitlm(1:8, data(s,:)) ;
coeff(s,1) = fit.Coefficients.Estimate(2) ;
end
[h,p,ci,stats] = ttest(coeff)
p = 0.1030
Option 2: Average across all subject and then run fitlm (on the group average) and use the stats in the fit LinearModel
M = mean(data) ;
fit = fitlm(1:8, M) ;
fit.Coefficients.pValue(2) = 0.155
thanks
pb

Answers (2)

dpb
dpb about 1 hour ago
Moved: dpb about 1 hour ago
Strictly speaking, you're supposed to know the test for hypothesis before designing and running the experiment.
The Q? here is what is the actual null hypothesis to test? If the hypotheis is one that there is a difference in difficulty between tasks, then the better test would be Friedman...if the question is whether there's a trend in difficulty versus test, that's different. Then it's a Q? of whether it's for the population as a whole or for each individual.
Without additional background on the test and objectives, I'd suggest Friedman.
data=readmatrix('data.txt');
stackedplot([data mean(data,2)])
friedman(data);
It is pretty low for significance, but you have no replications. As the bottom plot shows, there does appear to be a slight curvature in the average versus block although it would be difficult to show significance I suspect (I didn't try).
If the Friedman test were to turn out significant there are then post-hoc tests that can be utilized to identify which specifically differ.
I considered using friedman however there are no actual reeititions, consideering that there are 19 subjects and 8 tasks. However, the t-test is inappropriate, since the data appear not to be normally distributed. If none of the task scores can be zero or negative, then the lognormal distibution may be more appropriate.
A1 = readmatrix('data.txt')
A1 = 19×8
6.9011 10.7700 13.8070 16.1510 17.0180 20.1360 22.2640 13.6310 17.1100 16.0630 20.2080 12.0660 14.0380 28.7000 20.7400 17.3030 8.0825 13.1410 12.7320 25.8700 7.0998 6.7292 11.2540 7.6235 5.4173 8.9503 4.2487 1.8015 1.7145 0.3866 2.1199 9.4284 10.9450 16.4700 16.8660 13.6870 22.1510 8.3364 20.4840 26.0050 5.6463 2.2456 2.2951 8.6048 3.3609 1.6777 4.6868 2.2203 4.5081 6.7629 4.8043 7.7382 2.7232 5.6407 11.4580 4.4322 3.7913 3.2618 11.2600 1.4352 2.0173 7.4965 13.1190 5.4684 5.5242 0.4020 5.3666 2.5546 13.5970 7.7196 6.3798 0.3766 4.1787 9.2256 11.7950 6.0266 5.7861 7.8338 2.6496 2.6319 4.0627 3.1675 5.1919 8.8082 5.1689 7.8048 9.7785 13.1400 2.7955 10.9590 15.6410 3.8313 10.4450 19.3890 15.3080 8.5590 3.4822 4.1784 2.4938 3.5147 3.8841 4.4561 2.3881 4.7253 5.0303 5.4010 6.4618 3.4920 8.4972 3.8040 10.9280 5.8896 9.7166 11.0600 8.3587 6.1232 17.4880 14.1780 13.3290 10.9400
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
figure
plot((1:size(A1,2)), A1)
grid
xlabel('Task')
ylabel('Sobject Score')
legend(compose('Subject #%2d', 1:size(A1,1)), Location='northoutside', NumColumns=4)
xlim('padded')
figure
boxchart(A1, Notch='on')
xlabel('Task')
ylabel('Sobject Score')
figure
boxplot(A1, Notch='on')
xlabel('Task')
ylabel('Sobject Score')
figure
tiledlayout(4,5)
for k = 1:size(A1,1)
nexttile
histfit(A1(k,:), 5, 'lognormal')
title(sprintf('Subject #%2d',k))
end
sgtitle('Lognormal Fit')
figure
tiledlayout(4,5)
for k = 1:size(A1,1)
nexttile
histfit(A1(k,:), 5, 'normal')
title(sprintf('Subject #%2d',k))
end
sgtitle('Normal Fit')
tmean = mean(A1) % Task Means
tmean = 1×8
6.9375 9.0530 10.4197 8.6694 9.2670 10.0667 10.6752 9.0549
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
tmed = median(A1) % Task Medians
tmed = 1×8
5.5242 9.0569 9.7983 6.1133 7.0998 7.7245 10.9280 8.5590
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
Looking at the median values (also displayed in the boxchart and boxplot figures), there does not appear to be any specific trend across tasks.
I am not certain what to suggest, however there does not appear to be anything significant across tasks in these results.
.

6 Comments

Pat
Pat about 18 hours ago
Moved: dpb 4 minutes ago
Hi Star Strider and dpb
The 8 columns are repetitions, it's each subject data in different blocks of a task. So they are repeated measures in fact
i know I ould use one-way ANOVA with repeated mesures, but a significant main effect of Blocks woudn't tell me much, i'd have to run multiple t-test amonts the 8 blocks.
pb
dpb
dpb about 16 hours ago
Moved: dpb 4 minutes ago
You have 19 subjects each of whom did eight different tasks once. That's not replication unless some of the eight tasks are the same task presented again?
I don't think regression for slope is meaningful here; the task numbers aren't a suitable predictor variable.
I'm not certain what your oriignal design or intent is.
There doesn't appear to be any specific trend in the median values for each task block.
A1 = readmatrix('data.txt');
tmed = median(A1)
tmed = 1×8
5.5242 9.0569 9.7983 6.1133 7.0998 7.7245 10.9280 8.5590
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
mdl = fitlm(1:numel(tmed), tmed)
mdl =
Linear regression model: y ~ 1 + x1 Estimated Coefficients: Estimate SE tStat pValue ________ _______ ______ _________ (Intercept) 6.7417 1.4191 4.7507 0.0031566 x1 0.30195 0.28103 1.0745 0.32391 Number of observations: 8, Error degrees of freedom: 6 Root Mean Squared Error: 1.82 R-squared: 0.161, Adjusted R-Squared: 0.0216 F-statistic vs. constant model: 1.15, p-value = 0.324
figure
plot(mdl)
grid
.
Pat
Pat about 14 hours ago
Moved: dpb 3 minutes ago
like I said it's repeated measures, 8 blocks during the same task done one after the other; same day in the same experimental session...task lasts about 20 min; it's different for each subj since it's a fatiguing task...people fatigued at different rate...
you're right about the unimpressive increase across blocks. But for other parts of the data in the same dataset (or other data I have) there is a stronger increase....it does not change the question I asked !!!
But that was not the question!!
Which method is more statistically sound:
regression on each subject and then t-test on the slopes (beta weight of the regression) OR average of the group and then regression on that mean.
-pb
The null hypothesis is yet to be clearly stated. If the "block" is considered an ordinal sequence number and the hypothesis is that there is an overall increase in duration with time/repitition, then the question is whether the conclusion is whether the change is for the population as a whole or for any specific individual. Presuming the population, the answer is obvious, use a measure of the durations of the population. Given @Star Strider's observation the observations are not normally distributed, the median may be a better choice or applying a Box transformation first.
If one were to do each individually, one has the isse of deriving a suitable correction for the problem of significance inflation for multiple comparisons.
I though I explained very clearly that I just want to know which of the option is more statistically sound!!
Forget the data I sent, I thought it would make the question clearer, but it seems to trouble you, so forget them...I have plenty of others data to look at
*** Option 1 or 2 ***
it's more of a theorotical question I guess
the null hypothesis is: no Slope change across the blocks
it means nothing to have block 3, for example, being larger/smaller than the others blocks. I need to see if there is a continuous trend across the blocks
-pb

Sign in to comment.

Asked:

Pat
about 22 hours ago

Commented:

Pat
9 minutes ago

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!