MATLAB Answers

mph
0

parpool() stalls on Xeon Phi x200 with >50 workers

Asked by mph
on 18 May 2018
Latest activity Commented on by mph
on 22 May 2018

I am evaluating parpool() on my new Intel Xeon Phi "Knights Landing" 7210. I find that parpool('local',NumWorkers) successfully creates a pool for NumWorkers<51, but it stalls and fails for any number equal to or greater than 51.

My system: 64 physical cores | 265 logical cores | 6x16GB memory | OS = CentOS linux | Matlab version R2018a

Attempted solutions: (1) changed java heap size between 512MB and 8192MB; (2) set java ThreadStackSize via $MATLAB/bin/glnxa64/java.opts (tried -XX:ThreadStackSize=8192 and 16384); (3) distcomp.feature( 'LocalUseMpiexec', false );

Each worker created by parpool takes about 0.5GB (according to top), such that plenty of system memory is left. Java memory resources also seem not to be depleted.

Here is a test I ran:

%% parpool() test
distcomp.feature( 'LocalUseMpiexec', false )
JavaRuntimeSettings = java.lang.management.ManagementFactory.getRuntimeMXBean.getInputArguments
[~,freeSystemMemory]=system('vmstat -s -S M | grep "free memory"')
rJavaObj = java.lang.Runtime.getRuntime;
freeMemory = rJavaObj.freeMemory
totalMemory = rJavaObj.totalMemory
maxMemory = rJavaObj.maxMemory
for NumberOfWorkers = [50, 51]
  tic
  pool = parpool('local',NumberOfWorkers)
  TimeElapsed  = toc
[~,freeSystemMemory]=system('vmstat -s -S M | grep "free memory"')
  rJavaObj = java.lang.Runtime.getRuntime;
  freeMemory = rJavaObj.freeMemory
  totalMemory = rJavaObj.totalMemory
  maxMemory = rJavaObj.maxMemory
  delete(pool)
end

And here is the output I get:

ans =
logical
 0
JavaRuntimeSettings =
[-Xms64m, -XX:NewRatio=3, -Xmx2048m, -XX:MaxDirectMemorySize=2147400000, -XX:+AllowUserSignalHandlers, -Xrs, -XX:ThreadStackSize=16384, -Djava.library.path=/usr/local/MATLAB/R2018a/bin/glnxa64:/usr/local/MATLAB/R2018a/sys/jxbrowser/glnxa64/lib, vfprintf, -XX:ErrorFile=/home/mph/hs_error_pid38489.log, abort, -Duser.language=en, -Duser.country=US, -Dfile.encoding=UTF-8, -XX:ParallelGCThreads=6]
freeSystemMemory =
  '        85393 M free memory
   '
freeMemory =
 313054528
totalMemory =
 458752000
maxMemory =
 1.9687e+09
Starting parallel pool (parpool) using the 'local' profile ...
connected to 50 workers.
pool = 
Pool with properties: 
            Connected: true
           NumWorkers: 50
              Cluster: local
        AttachedFiles: {}
    AutoAddClientPath: true
          IdleTimeout: 3 minutes (3 minutes remaining)
          SpmdEnabled: true
TimeElapsed =
   69.1710
freeSystemMemory =
    '        65170 M free memory
     '
freeMemory =
   351541184
totalMemory =
   448266240
maxMemory =
   1.9687e+09
Parallel pool using the 'local' profile is shutting down.
Starting parallel pool (parpool) using the 'local' profile ...
connected to 51 workers.

At that point it stalls and I never get the prompt back. Using the top command in the linux terminal I can see plenty of idle Matlab workers.

When I terminate the process (Ctr+c) within Matlab I get the following:

    Operation terminated by user during parallel.internal.queue.JavaBackedFuture/waitScalar (line 211)
In parallel.Future>@(o)waitScalar(o,predicate,waitGranularity,deadline)
In parallel.Future/wait (line 292)
            ret = all(arrayfun(@(o) waitScalar(o, predicate, waitGranularity, deadline), ...
In parallel.Future/fetchOutputsImpl (line 574)
            wait(F);
In parallel.Future/fetchOutputs (line 341)
                varargout = fetchOutputsImpl(F(:), nargout, varargin{:});
In parallel.Pool>iPostLaunchSetup (line 674)
    mapping = fetchOutputs(parfevalOnAll(pool, @iGetMachineToWorkerMappingAndUnfreezePaths, 1, ...
In parallel.Pool.hBuildPool (line 588)
            iPostLaunchSetup(aPool, client.ParallelJob.AdditionalPaths);
In parallel.internal.pool.doParpool (line 18)
    pool = parallel.Pool.hBuildPool(constructorArgs{:});
In parpool (line 98)
    pool = parallel.internal.pool.doParpool(varargin{:});
In partictoc (line 12)
    pool = parpool('local',NumberOfWorkers) 

So, what are these workers waiting for and why? How to make them do work?

  0 Comments

Sign in to comment.

1 Answer

Answer by Sangeetha Jayaprakash on 21 May 2018

Hi,

If you are referring to Xeon Phi host processors (as introduced with the Knights Landing architecture), they are compatible with the Parallel Computing Toolbox, as any other x86_64 processor with multiple cores. If you would like to use Xeon Phi coprocessors, they are not currently supported.

  2 Comments

Thanks for the response. However, I have the socketed version, not the coprocessor. As you say, it should just work like any other x86_64 cpu, but it doesn't. What other information can I provide to help trouble shoot?

[root@230-83 mph]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                256
On-line CPU(s) list:   0-255
Thread(s) per core:    4
Core(s) per socket:    64
Socket(s):             1
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 87
Model name:            Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz
Stepping:              1
CPU MHz:               1297.968
CPU max MHz:           1500.0000
CPU min MHz:           1000.0000
BogoMIPS:              2600.09
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
NUMA node0 CPU(s):     0-255
NUMA node1 CPU(s):     
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ring3mwait epb fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt dtherm arat pln pts

And here is info on my Matlab install:

>> ver('distcomp')
-----------------------------------------------------------------------------------------------------
MATLAB Version: 9.4.0.813654 (R2018a)
MATLAB License Number: 648372
Operating System: Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64
Java Version: Java 1.8.0_144-b01 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode
-----------------------------------------------------------------------------------------------------
Parallel Computing Toolbox                            Version 6.12        (R2018a)

Sign in to comment.