fitlm warns of singular design matrix when run in parfor loop, but not in a serial for loop (same data)

5 views (last 30 days)
I wished to estimate the distributions of some linear regression beta values using a bootstrap resampling. To save time, I ran the fitlm beta estimation of bootstrap samples in a parfor loop. To do so, I did the following:
  1. Set random seed for reproducibility.
  2. Generated bootstrap replicate indices (indices of original data sampled with replacement)
  3. Passed original data table and a bootstrap replicate's indices to a worker using parfor.
  4. Constructed the bootstrapped data on the worker using the original data table and bootstrap indices
  5. Ran fitlm on the worker and saved the beta values from the linear regression
When I run this code, I get the following warning: "Warning: Regression design matrix is rank deficient to within machine precision." This warning appears for many (but not all) of my bootstrap replicates.
However, when I run the above process serially, I do not get the rack-deficiency warning. The bootstrapped data should be the same between serial and parallel cases as I am setting the random seed before generating the boostrap replicates; I simply switch the parfor to for. I am not suppressing warnings in either case.
For what it is worth, my data table contains both continuous and categorical variables. When I limit my model to continuous variables, I do not get the rank-deficiency warning when I run the code in parallel. This warning is only generated (and only in the parallel case) when I include categorical variables in my model. I have roughly 13 million rows in my data, so unfortunately I can't share it here for error replication purposes (>1GB in size). However, with this many data points it seems unlikely a bootstrap replicate would generate a rank deficient design matrix by chance - let alone many rank deficient matrices.
When I compare the beta values generated by parallel and serial fitlm execution, they do differ but only at the level of machine precision (mean absolute difference: 8.8e-16). So despite throwing the rank deficiency error in parallel, fitlm seems to be arriving at very similar beta estimates regardless of whether it's run in parallel or serially.
Any feedback on this would be greatly appreciated! I have not previously seen a case of parallel-specific warnings when using the parallel toolbox. I am speculating that the data table might be modified (compressed?) when it is passed to the workers, leading to problems with the design matrix. But this is just speculation. I didn't see anything in the documentation to reflect this.
Thanks!
Pseudo-code below:
rng('default'); % Set random seed
rng(1);
[~,bootsamp]=bootstrp(10000,[],1:height(bigDataTable)); % Generate 10k bootstrap replicates
betas_par = nan(10000,5); % Variables to hold beta estimates
betas_ser = nan(10000,5);
% This code leads to rank-deficient design matrix warnings:
parfor i = 1:10000
bootDataTable = bigDataTable(bootsamp(:,i),:); % Create boostrapped data on worker
mdl = fitlm(bootDataTable, 'y~1+x1+x2+x3+x4'); % Fit betas on worker (x1-x4 are continuous or categorical vars)
betas_par(i,:) = mdl.Coefficients.Estimate'; % Save beta coefficients
end
% This almost-identical code does not produce warnings despite running the same models with the same data:
for i = 1:10000
bootDataTable = bigDataTable(bootsamp(:,i),:); % Create boostrapped data
mdl = fitlm(bootDataTable, 'y~1+x1+x2+x3+x4'); % Fit betas (x1-x4 are continuous or categorical vars)
betas_ser(i,:) = mdl.Coefficients.Estimate'; % Save beta coefficients
end
  15 Comments
Walter Roberson
Walter Roberson on 1 Sep 2022
as an experiment, try asking to fit the same data on multiple workers. If you can track down which i value is giving the warning in your existing code, try that value on multiple workers
James Hengenius
James Hengenius on 2 Sep 2022
Edited: James Hengenius on 2 Sep 2022
So I think I found the reason why this warning is occurring. But I don't understand why the warning is only occurring on workers in my original code.
One of the categorical variables I am including is race. One of the possible values for this variable is Other. (Only one person in my sample has the value of Other.) So when I do bootstrap resampling, it is possible that a resulting bootstrap replicate will contain no rows with race == Other.
As a test, I excluded all rows with race == Other
tbl; % Is my full dataset
tbl_exclude = tbl; tbl_exclude(tbl_exclude.race=='Other')=[]; % Remove all rows with race==Other
Then I checked that the rows were in fact removed.
find(tbl_exclude.race=='Other')
ans =
0×1 empty double column vector
Finally, I ran a test model in fitlm.
fitlm(tbl_exclude, 'y~1+race')
Warning: Regression design matrix is rank deficient to within machine precision.
> In classreg.regr/CompactTermsRegression/checkDesignRank (line 35)
In LinearModel.fit (line 1040)
In fitlm (line 121)
ans =
Linear regression model:
y ~ 1 + race
Estimated Coefficients:
Estimate SE tStat pValue
__________ ______ ______ ______
(Intercept) 6.521e+06 6781.8 961.55 0
race_Black/African American 4.0891e+05 6959.8 58.753 0
race_Other 0 0 NaN NaN
race_White 3.7114e+05 6947.6 53.419 0
Number of observations: 13801200, Error degrees of freedom: 13801197
Root Mean Squared Error: 3.98e+06
R-squared: 0.000255, Adjusted R-Squared: 0.000255
F-statistic vs. constant model: 1.76e+03, p-value = 0
As you can see, it prduces the warning I was seeing. And even though the table I pass to fitlm doesn't contain any race==Other data, fitlm still tries to fit the beta for the variable race_Other.
I am guessing that the table tbl_exclude contains metadata from the original table tbl. This metadata might record that race==Other is a possibility even though it is no longer present in tbl_exclude. I looked in tbl_exclude.Properties for anything like this but didn't see it.
I am also unsure why my original code suppressed this warning when I ran the bootstrap serially.
EDIT TO ADD:
I found the function removecats (https://www.mathworks.com/help/matlab/ref/categorical.removecats.html) which removes unused categories from categorical arrays. After running
tbl_exclude.race = removecats(tbl_exclude.race)
I no longer get the rank deficient design matrix warning from fitlm.
This doesn't explain why the warning was suppressed in serial but not parallel execution, but it does identify the source of the warning. My bootstrap resample had a "ghost" category that was not represented in the data but was part of the categorical variables metadata. removecats cleaned this up.

Sign in to comment.

Answers (2)

Bruno Luong
Bruno Luong on 1 Sep 2022
Can you please try this code
[~,bootsamp]=bootstrp(10000,[],1:height(bigDataTable)); % Generate 10k bootstrap replicates
betas_par = nan(10000,5); % Variables to hold beta estimates
EnvenBiggerDataTab = bigDataTable(bootsamp,:);
EnvenBiggerDataTab = reshape(EnvenBiggerDataTab, [size(bootsamp), size(bigDataTable,2)]);
EnvenBiggerDataTab = permure(EnvenBiggerDataTab,[1 3 2]);
% This code leads to rank-deficient design matrix warnings:
parfor i = 1:10000
bootDataTable = EnvenBiggerDataTab(:,:,i); % Create boostrapped data on worker
mdl = fitlm(bootDataTable, 'y~1+x1+x2+x3+x4'); % Fit betas on worker (x1-x4 are continuous or categorical vars)
betas_par(i,:) = mdl.Coefficients.Estimate'; % Save beta coefficients
end
  3 Comments
Bruno Luong
Bruno Luong on 1 Sep 2022
Edited: Bruno Luong on 1 Sep 2022
I believe when you do
bootDataTable = bigDataTable(bootsamp(:,i),:)
inside parfor, there must be some complex exchanges between workers (I'm even surprise such syntax is accepted by MATLAB), or some concurrent access to a shared variable is going on. It is not clear to me how it handles internally.
I just suspect the problem could come from there.
May be you could try with a toy example rather than your final bootstrap to confirm this is an issue.
Edric Ellis
Edric Ellis on 1 Sep 2022
There is no worker-to-worker communication at all in parfor. In the statement:
bootDataTable = bigDataTable(bootsamp(:,i),:)
The variable bootsamp is a parfor sliced variable - only the relevant pieces of bootsamp are sent to each worker. The variable bigDataTable is a broadcast variable. The whole value is copied to each worker.

Sign in to comment.


Bruno Luong
Bruno Luong on 1 Sep 2022
Could you try this:
C_bigDataTable = parallel.pool.Constant(bigDataTable);
% This code leads to rank-deficient design matrix warnings:
parfor i = 1:size(bootsamp,2)
bootDataTable = C_bigDataTable.Value(bootsamp(:,i),:); % Create boostrapped data on worker
mdl = fitlm(bootDataTable, 'y~1+x1+x2+x3+x4'); % Fit betas on worker (x1-x4 are continuous or categorical vars)
betas_par(i,:) = mdl.Coefficients.Estimate'; % Save beta coefficients
end
  2 Comments
James Hengenius
James Hengenius on 1 Sep 2022
Hi Bruno, I gave this a try and made my bigDataTable a constant on the workers. Unfortunately, this had no effect. The parallel execution of fitlm still produces a rank deficient design matrix warning even with this setup.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!