Can a MATLAB Distributed Computing Engine worker resume execution of a task after a crash or restart?

1 view (last 30 days)
If I stop a worker that is currently executing a task, that task fails and the job fails with the following error message:
 
Warning: Errors occurred during execution of task 2. Results may be incorrect.
The worker working on this task re-registered with the job manager as an idle worker.
This indicates that the worker crashed during execution or had problems submitting its results.
I would like to configure my job to be robust against worker failures.

Accepted Answer

MathWorks Support Team
MathWorks Support Team on 31 Jan 2017
This enhancement has been incorporated in Parallel Computing Toolbox 4.0 (R2008b). When using a job manager in this version onward, if a task does not complete due to certain system failures, it can attempt to rerun up to a specified number of times. Three new properties 'MaximumNumberOfRetries', 'AttemptedNumberOfRetries', and 'FailedAttemptInformation' allow this functionality.
For more information, visit the Parallel Computing Toolbox 4.0 (R2008b) Release Notes:
You may access the same page locally by typing the following at the MATLAB prompt:
 
web([docroot,'/toolbox/distcomp/rn/brorfjs-1.html#brsb5p3-1'])
For previous product releases, read below for any possible workarounds:
The ability to resume execution of the task or to automatically re-queue a failed task after a worker crashes or is stopped is not available in the MATLAB Distributed Computing Engine.
If you need to restart a worker, it would be best to do this when the worker is idle. If you need to recover from a failed job, you could write code to submit a new task if the job failed with that particular error message.
The documentation for the MATLAB Distributed Computing Engine 2.0.1 (R2006a) and previous versions does not clearly spell out this limitation of the checkpoint directories. The checkpoint directories allow engine services to automatically resume their sessions after a system goes down and comes back up, which minimizes the loss of data. However, if a MATLAB worker goes down during the evaluation of a task, that task is neither reevaluated nor reassigned to another worker. In this case, a finished job may not have a complete set of output data, as data from any unfinished tasks might be missing.

More Answers (0)

Categories

Find more on MATLAB Parallel Server in Help Center and File Exchange

Tags

No tags entered yet.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!