fitlm warns of singular design matrix when run in parfor loop, but not in a serial for loop (same data)
5 views (last 30 days)
Show older comments
I wished to estimate the distributions of some linear regression beta values using a bootstrap resampling. To save time, I ran the fitlm beta estimation of bootstrap samples in a parfor loop. To do so, I did the following:
- Set random seed for reproducibility.
- Generated bootstrap replicate indices (indices of original data sampled with replacement)
- Passed original data table and a bootstrap replicate's indices to a worker using parfor.
- Constructed the bootstrapped data on the worker using the original data table and bootstrap indices
- Ran fitlm on the worker and saved the beta values from the linear regression
When I run this code, I get the following warning: "Warning: Regression design matrix is rank deficient to within machine precision." This warning appears for many (but not all) of my bootstrap replicates.
However, when I run the above process serially, I do not get the rack-deficiency warning. The bootstrapped data should be the same between serial and parallel cases as I am setting the random seed before generating the boostrap replicates; I simply switch the parfor to for. I am not suppressing warnings in either case.
For what it is worth, my data table contains both continuous and categorical variables. When I limit my model to continuous variables, I do not get the rank-deficiency warning when I run the code in parallel. This warning is only generated (and only in the parallel case) when I include categorical variables in my model. I have roughly 13 million rows in my data, so unfortunately I can't share it here for error replication purposes (>1GB in size). However, with this many data points it seems unlikely a bootstrap replicate would generate a rank deficient design matrix by chance - let alone many rank deficient matrices.
When I compare the beta values generated by parallel and serial fitlm execution, they do differ but only at the level of machine precision (mean absolute difference: 8.8e-16). So despite throwing the rank deficiency error in parallel, fitlm seems to be arriving at very similar beta estimates regardless of whether it's run in parallel or serially.
Any feedback on this would be greatly appreciated! I have not previously seen a case of parallel-specific warnings when using the parallel toolbox. I am speculating that the data table might be modified (compressed?) when it is passed to the workers, leading to problems with the design matrix. But this is just speculation. I didn't see anything in the documentation to reflect this.
Thanks!
Pseudo-code below:
rng('default'); % Set random seed
rng(1);
[~,bootsamp]=bootstrp(10000,[],1:height(bigDataTable)); % Generate 10k bootstrap replicates
betas_par = nan(10000,5); % Variables to hold beta estimates
betas_ser = nan(10000,5);
% This code leads to rank-deficient design matrix warnings:
parfor i = 1:10000
bootDataTable = bigDataTable(bootsamp(:,i),:); % Create boostrapped data on worker
mdl = fitlm(bootDataTable, 'y~1+x1+x2+x3+x4'); % Fit betas on worker (x1-x4 are continuous or categorical vars)
betas_par(i,:) = mdl.Coefficients.Estimate'; % Save beta coefficients
end
% This almost-identical code does not produce warnings despite running the same models with the same data:
for i = 1:10000
bootDataTable = bigDataTable(bootsamp(:,i),:); % Create boostrapped data
mdl = fitlm(bootDataTable, 'y~1+x1+x2+x3+x4'); % Fit betas (x1-x4 are continuous or categorical vars)
betas_ser(i,:) = mdl.Coefficients.Estimate'; % Save beta coefficients
end
15 Comments
Walter Roberson
on 1 Sep 2022
as an experiment, try asking to fit the same data on multiple workers. If you can track down which i value is giving the warning in your existing code, try that value on multiple workers
Answers (2)
Bruno Luong
on 1 Sep 2022
Can you please try this code
[~,bootsamp]=bootstrp(10000,[],1:height(bigDataTable)); % Generate 10k bootstrap replicates
betas_par = nan(10000,5); % Variables to hold beta estimates
EnvenBiggerDataTab = bigDataTable(bootsamp,:);
EnvenBiggerDataTab = reshape(EnvenBiggerDataTab, [size(bootsamp), size(bigDataTable,2)]);
EnvenBiggerDataTab = permure(EnvenBiggerDataTab,[1 3 2]);
% This code leads to rank-deficient design matrix warnings:
parfor i = 1:10000
bootDataTable = EnvenBiggerDataTab(:,:,i); % Create boostrapped data on worker
mdl = fitlm(bootDataTable, 'y~1+x1+x2+x3+x4'); % Fit betas on worker (x1-x4 are continuous or categorical vars)
betas_par(i,:) = mdl.Coefficients.Estimate'; % Save beta coefficients
end
3 Comments
Bruno Luong
on 1 Sep 2022
Edited: Bruno Luong
on 1 Sep 2022
I believe when you do
bootDataTable = bigDataTable(bootsamp(:,i),:)
inside parfor, there must be some complex exchanges between workers (I'm even surprise such syntax is accepted by MATLAB), or some concurrent access to a shared variable is going on. It is not clear to me how it handles internally.
I just suspect the problem could come from there.
May be you could try with a toy example rather than your final bootstrap to confirm this is an issue.
Edric Ellis
on 1 Sep 2022
There is no worker-to-worker communication at all in parfor. In the statement:
bootDataTable = bigDataTable(bootsamp(:,i),:)
The variable bootsamp is a parfor sliced variable - only the relevant pieces of bootsamp are sent to each worker. The variable bigDataTable is a broadcast variable. The whole value is copied to each worker.
Bruno Luong
on 1 Sep 2022
Could you try this:
C_bigDataTable = parallel.pool.Constant(bigDataTable);
% This code leads to rank-deficient design matrix warnings:
parfor i = 1:size(bootsamp,2)
bootDataTable = C_bigDataTable.Value(bootsamp(:,i),:); % Create boostrapped data on worker
mdl = fitlm(bootDataTable, 'y~1+x1+x2+x3+x4'); % Fit betas on worker (x1-x4 are continuous or categorical vars)
betas_par(i,:) = mdl.Coefficients.Estimate'; % Save beta coefficients
end
2 Comments
Bruno Luong
on 1 Sep 2022
I run out of idea.
If you can reproduce a small example may be you should report officially as a potential bug.
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!