How do I troubleshoot the "lost connection to worker X" parallel error?

154 views (last 30 days)
How do I troubleshoot when I encounter the error
ERROR: The client lost connection to worker #. This might be due to network problems, or the interactive communicating job might have errored.
This error often preceded by a warning:
A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
Before the full error is reported​
ERROR: All workers aborted during execution of the parfor loop.
Error in mycode (line 19) parfor j = 1:n
The client lost connection to worker 3. This might be due to network problems, or the interactive communicating job might have errored.

Accepted Answer

MathWorks Support Team
MathWorks Support Team on 2 Nov 2023
Edited: MathWorks Support Team on 2 Nov 2023
There are two key causes to this error. The easiest to rule out is whether the worker in question has crashed.
MATLAB Worker Crash
A crashed worker will leave behind a crash dump just like a normal MATLAB. On a cluster this crash dump will be on the compute node hosting that worker.
Crash dumps can also be located for non MATLAB Job Scheduler clusters in the following location:
>> c=parcluster()
>> c.JobStorageLocation
In that location look for the Job# folder for the job number which failed and access any Job#.log files.
Once you have the crash dump examine the dump for further information. If the crash was in mex code you wrote then it is worth running that mex code locally to check for issues. Otherwise please contact Technical Support to assist with understanding and troubleshooting your crash 
Network/Memory/Communication Issues
If no crash dumps can be found then network/memory/communication issues are the likely cause. On Linux this might be a worker being terminated by the OS for using too much memory or when any of the machines are significantly slowed down by resource contention (e.g. memory swapping) this can delay communication signals between workers enough to disrupt the pool.
As well as the node slowdown there is the additional chance of failure from network latency or connection dropping to contend with. At this point try checking the network reliability or consider the step below.
Setting SpmdEnabled to false
An SpmdEnabled pool cannot continue once communication between workers or between workers and client has been lost. If you are using the local scheduler or MATLAB Job Scheduler and only using parfor and parfeval then you are able to instead specify the flag 'SpmdEnabled' 'false'. See the documentation about SpmdEnabled for details- https://www.mathworks.com/help/parallel-computing/parpool.html
A pool with SpmdEnabled set to false is unable to complete spmd statements 
With this option the remaining workers will continue to complete the parallel work even after 1 worker has lost connection.
If you need further help and support with dealing with this error please contact

More Answers (0)

Categories

Find more on MATLAB Parallel Server in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!