Parfor: worker aborted during execution of the parfor loop

When running my parfor loop on a remote cluster (with 16 c5.xlarge, 2 core machines and a dedicated headnode m5.xlarge, 2 core) I get following error:
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
> In distcomp/remoteparfor/handleIntervalErrorResult (line 245)
In distcomp/remoteparfor/getCompleteIntervals (line 395)
In parallel_function>distributed_execution (line 746)
In parallel_function (line 578)
In FUN_CLUSTER_FORECASTING (line 54)
In parallel.internal.cluster.executeFunction (line 29)
In parallel.internal.evaluator.evaluateWithNoErrors (line 14)
In parallel.internal.evaluator/MJSStreamingEvaluator/evaluate (line 40)
In dctEvaluateTask>iEvaluateTask/nEvaluateTask (line 354)
In dctEvaluateTask>iEvaluateTask (line 175)
In dctEvaluateTask (line 81)
In distcomp_evaluate_task>iDoTask (line 152)
In distcomp_evaluate_task (line 74)
In distcomp_evaluate_task_mvm (line 39)
Sending a stop signal to all the labs...
Y is of size (226 × 440) when I get the error. The parfor loop runs in smaller specifications without any problems (does not fail to compute when Y is of size (226 × 120)).
A simplified version of the parfor loop:
% initialize the output variable
forecast = zeros(T-T_thres+1,h,length(series_to_eval),1);
% irep is an array, e.g. irep = [113,114,..,214]
irep = T_thres:T-h;
parfor (ij = 1:length(irep))
fun = BCTRVAR(Y(1:irep(ij),:),h,series_to_eval);
% h is the forecast horizon, e.g. h = [1,2,..,12]
for ii = 1:h
forecast(ij,ii,:,:) = fun(:,ii,series_to_eval);
end
end
I am not sure if the following warning is related but I get a warning on the variable Y:
'The entire array or structure Y is a broadcast variable. This might result in unnecessary communication overhead.'
Could the overhead be the cause of the issue?

4 Comments

I sliced my data to avoid potential problems from overhead produced by broadcast variables. Unfortunately, this does not solve my issue and the error persists.
%initialize slice cell arrays
Y_slice = cell(1,length(irep));
yi_slice = cell(length(irep),h);
% pre-slice Y to avoid overhead problems
for ik = 1:length(irep)
Y_slice{ik} = Y(1:irep(ik),:);
for ii = 1:h
yi_slice{ik,ii} = Y(ik+ii,series_to_eval);
end
end
forecast = zeros(T-T_thres+1,h,length(series_to_eval),1);
irep = T_thres:T-h;
parfor (ij = 1:length(irep))
fun = BCTRVAR(Y_slice{ij},h,series_to_eval);
for ii = 1:h
forecast(ij,ii,:,:) = fun(:,ii,series_to_eval);
end
end
This is probably not related to the "broadcast variable" message (at least, not directly). The initial error you received indicates that the worker MATLAB process crashed. Can you reproduce the problem using the 'local' cluster? It may be a little easier to diagnose the problems there. Do you know if the parfor loop needs to transmit a large amount of data? You can use ticBytes/tocBytes to work this out.
The problem does not reproduce when I run the batch on a 'local' cluster.
The parfor loop runs without any problems on the cloud cluster when I change the function inside the parfor loop.
To see how much data was transmited to the workers, I tested the parfor loop on the 'local' cluster with another function instead of BCTRVAR(.):
BytesSentToWorkers BytesReceivedFromWorkers
1 2.0043e+07 1.93e+05
2 2.0004e+07 1.8097e+05
3 1.9247e+07 1.6893e+05
Total 5.9294e+07 5.429e+05
Is this a large amount of data being transferred?
@Edric Ellis are there ways to get a more detailed error message?
as you suggested in Problem with parfor loop I ran the remote cluster also when only requesting one worker. This did not generate the problem. Do you have any thoughts or ideas what this could mean?

Sign in to comment.

 Accepted Answer

I solved the problem by changing the worker machine type from c5.xlarge with 4 GB/core to m5.xlarge with 8 GB/core. So I think that the workers must have had insufficient memory.
This answer to the question
"Please suggest me to write the parallel loop in MATLAB without the workers getting aborted during the course of execution?"
by Raymond Norris helped me figure this out.

3 Comments

I've started getting the same error for code that used to work error-free until recently.
How do you exactly specify the types of workers in MATLAB if I may ask? The link to the answer above is broken.
Cheers!
I increased the memory and the problem is solved. I increase from 10GB of memory per core (slot), to 20GB of memory per core (slot),
It would be awesome if you could write out the steps you took to solve the issue. The link is broken, and I went to the thread, but it does not help at all.
Please please, programmers, when you find a solution, just write it out!

Sign in to comment.

More Answers (0)

Categories

Products

Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!