Discarded Messages with SPMD and labReceive ... why?
2 views (last 30 days)
Show older comments
EvanThomas on 18 Jul 2022
Edited: EvanThomas on 20 Jul 2022
I am using SPMD and trying to get some workers communicating w/ each other. There is a flag they need to send/receive. Whoever gets there job done and comitted first, sends out the flag, which the remaining workers should receive and therefore not commit their work.
Here is some abstact code that hopefully gets the point across of what I am trying to do. I would have thought the labBarrier at the bottom would have ensured all workers coming in 2nd place and after would have received the flag from the first workker finished. Some do, but .... I also get many of the warning messages similar to the following:
Warning: An incoming message was discarded from lab 2 (tag: 2)
Indeed some workers are indeed missing the message, even if they finish seconds after that flag was sent out.
How does labSend work? I am missing something here?
% Emulating workers doing some variable time task
% See if other workers got their first and sent an update
[Updates(i),srcWkrIdx,tag] = labReceive(i,2);
Updates(i) = 0;
% Commit work
flag = 1
% Otherwise take a nap
flag = 0
labSend(flag,agentVec(agentVec ~= labindex),2);
I would actually say exactly the opposite - MPI is (generally) very reliable and predictable. I shall post an answer with a suggestion as to how you might proceed.
In the code that you've written, each worker is guaranteed to labSend to each other worker. However, each worker is not guaranteed to labReceive from each other worker. There are guaranteed to be mismatched send/receives.
Using conditional receives in this way is not a robust way to get the workers to collaborate - you have an ordering problem that cannot be solved. I think you can probably achieve your goal by using one of the "reduction" functions which are designed to collect together results from multiple workers. In particular, you could try gcat to allow each worker to find out what happened on every other worker. gcat (effectively) collects values from all workers and concatenates them together on each worker. In this way, you don't need the labBarrier call either. Something a bit like this:
myResult = doSomeWork();
allResults = gcat(myResult);
% Now, choose what to do based on the results from all workers.
Find more on Startup and Shutdown in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!Start Hunting!