Parfor: worker aborted during execution of the parfor loop
Show older comments
When running my parfor loop on a remote cluster (with 16 c5.xlarge, 2 core machines and a dedicated headnode m5.xlarge, 2 core) I get following error:
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
> In distcomp/remoteparfor/handleIntervalErrorResult (line 245)
In distcomp/remoteparfor/getCompleteIntervals (line 395)
In parallel_function>distributed_execution (line 746)
In parallel_function (line 578)
In FUN_CLUSTER_FORECASTING (line 54)
In parallel.internal.cluster.executeFunction (line 29)
In parallel.internal.evaluator.evaluateWithNoErrors (line 14)
In parallel.internal.evaluator/MJSStreamingEvaluator/evaluate (line 40)
In dctEvaluateTask>iEvaluateTask/nEvaluateTask (line 354)
In dctEvaluateTask>iEvaluateTask (line 175)
In dctEvaluateTask (line 81)
In distcomp_evaluate_task>iDoTask (line 152)
In distcomp_evaluate_task (line 74)
In distcomp_evaluate_task_mvm (line 39)
Sending a stop signal to all the labs...
Y is of size (226 × 440) when I get the error. The parfor loop runs in smaller specifications without any problems (does not fail to compute when Y is of size (226 × 120)).
A simplified version of the parfor loop:
% initialize the output variable
forecast = zeros(T-T_thres+1,h,length(series_to_eval),1);
% irep is an array, e.g. irep = [113,114,..,214]
irep = T_thres:T-h;
parfor (ij = 1:length(irep))
fun = BCTRVAR(Y(1:irep(ij),:),h,series_to_eval);
% h is the forecast horizon, e.g. h = [1,2,..,12]
for ii = 1:h
forecast(ij,ii,:,:) = fun(:,ii,series_to_eval);
end
end
I am not sure if the following warning is related but I get a warning on the variable Y:
'The entire array or structure Y is a broadcast variable. This might result in unnecessary communication overhead.'
Could the overhead be the cause of the issue?
4 Comments
Merlin Scherer
on 4 Jan 2022
Edric Ellis
on 5 Jan 2022
This is probably not related to the "broadcast variable" message (at least, not directly). The initial error you received indicates that the worker MATLAB process crashed. Can you reproduce the problem using the 'local' cluster? It may be a little easier to diagnose the problems there. Do you know if the parfor loop needs to transmit a large amount of data? You can use ticBytes/tocBytes to work this out.
Merlin Scherer
on 5 Jan 2022
Merlin Scherer
on 7 Jan 2022
Edited: Merlin Scherer
on 7 Jan 2022
Accepted Answer
More Answers (0)
Categories
Find more on Third-Party Cluster Configuration in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!