Matlab Parallel Server worker to worker communication port

11 views (last 30 days)
However, when I tried to run validation test from my client node with 2 or more worker, I got error from worker node saying:
Error using parallel.internal.getJavaFutureResult (line 33)
MatlabPoolPeerInstance{fLabIndex=1, fNumberOfLabs=4,
fUuid=277db6ba-1f54-4ca9-8045-eb58a0f82d08} was unable
to connect to [an ip][my-worker-ip]:35721 most probably because
[an ip][my-worker-ip]:35721 refused the connection: Connection refused
I am confused and don't know how to resolve this issue. I opened all the ports required:
BASEPORT+1000 to BASEPORT+1000+2*nW (only to other cluster machines)
But seems like it's trying to reach some random port > 30000. I believe my head-worker, client-worker, client-head communication are good, but worker-worker communication is blocked. Is there any specific port for communication from worker to worker? It seems like very random, and I get different ports from the error message all the time.
Thanks in advance!

Accepted Answer

Ziwei Zhao
Ziwei Zhao on 24 Nov 2020
I've found a solution towards this with the help of Matlab Support team. Thank you!
In my set up, I need to expose head and worker externally so that my client UI, outside of the VPC could connect with them. However, Matlab Parallel Server has an assumption that communication between machines within the cluster is not restricted, which implicates all ports need to be open for woker-worker communication.
The solution is to add
in the mjs_def file, with the hostname in that file to be external IP. In this case, all worker-worker communication is no longer through external Load Balancer service, but locally. (In my set up, all worker workloads are within the same GKE cluster so they could talk internally)

More Answers (1)

Raymond Norris
Raymond Norris on 20 Nov 2020
A few questions
  • Which version of MATLAB are you running?
  • Is MATLAB running on your local machine and MATLAB Parallel Server on GKE? Or is MATLAB also on GCP?
  • On the MATLAB client, what does the following return
>> pctconfig
  • Other than the last stage failing, do all the other stages pass?
There's an environment variable, in $matlabroot/toolbox/parallel/bin/ (on the cluster), called ALL_SERVER_SOCKETS_IN_CLUSTER, which might need to be uncommented and set to "true" (and then restart mjs from scratch).
Ziwei Zhao
Ziwei Zhao on 20 Nov 2020
I uncommented that line but it doesn't really help since the value is default to "true" in R2020b. Oh maybe the log can help better in this case (it is coming from my worker1, I have two workers running currently):
2020 11 20 04:46:56.818 UTC | 0 | Caught an error during nEvaluateTask:
Error using parallel.internal.getJavaFutureResult (line 33)
MatlabPoolPeerInstance{fLabIndex=1, fNumberOfLabs=1, fUuid=f2234a87-aee9-4160-b303-fed7fc87b754} was unable to connect to MY_WORKER2_IP:44107 most probably because MY_WORKER2_IP:44107 refused the connection: Connection refused
Error in parallel.internal.getJavaFutureInterruptibly (line 35)
[done, value] = parallel.internal.getJavaFutureResult(...
Error in parallel.internal.pool.AbstractWorker/connectToClient (line 140)
obj.Session = parallel.internal.getJavaFutureInterruptibly(sessionFuture);
Error in parallel.internal.pool.PoolWorker/start (line 35)
obj.connectToClient( connInfo );
Error in parallel.internal.apishared.PoolFactory.instantiate (line 41)
Error in parallel.CommunicatingJob/pInstantiatePool (line 175)
[job.IsPoolTask, job.Pool] = parallel.internal.apishared.PoolFactory.instantiate( job.pIsInteractivePool(), ...
Error in dctEvaluateTask>iEvaluateTask/nEvaluateTask (line 299)
job.pInstantiatePool(task, fcn, nOut, args);
Error in dctEvaluateTask>iEvaluateTask (line 166)
Error in dctEvaluateTask (line 78)
[resultsFcn, taskPostFcn, taskEvaluatedOK] = iEvaluateTask(job, task, runprop);
Error in distcomp_evaluate_task>iDoTask (line 153)
dctEvaluateTask(postFcns, finishFcn);
Error in distcomp_evaluate_task (line 75)
iDoTask(handlers, attempt, postFcns);
Error in distcomp_evaluate_task_mvm (line 40)
distcomp_evaluate_task(outputWriterStack, workerProxy, performJobInit, justStarted, serializedCredentials, taskExecutionInfo, postFcns)
It seems to be it's trying to connect to the port 44107 which got denied since I never open that port in my container (as this is never mentioned to be required in the reserved port above). By any chance you can give me some insights why it is trying to contact this specific port? :)
Thank you!

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!