parfor loop errors on AMD cores limits

13 views (last 30 days)
Hello,
I am trying to run a simple parfor script on nodes on our cluster. The code works fine until I try to use > 46 CPUs (workers) at once, on one server. Some of our latest nodes have 128 AMD cores. I can run up to 56 cores on our Intel CPU servers (nodes) , but on any AMD I get errors (java runtime and others) when using >46 cores. It would be great to use all 128 cores on these new nodes for our MATLAB code. I have tried increasing memory and I still get these errors when using > 46 cores.
I will attach the MATLAB crash dump, code and sbatch files.
My sbatch file (I have tried many, many different parameters) -
#!/bin/bash
#SBATCH -J pfor_matlab
#SBATCH -o pfor".%j".out
#SBATCH -e pfor".%j".err
#SBATCH -t 45:00
#SBATCH -N 1
#SBATCH -p normal
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=48
module load matlab
hostname -s
env | egrep SLURM
matlab -nosplash -nodesktop -r "pfor"
The sbatch produces this output in the SLURM .err file-
Error using parpool (line 145)
Parallel pool failed to start with the following error. For more detailed
information, validate the profile 'local' in the Cluster Profile Manager.
Error in pfor (line 5)
parpool('local', str2num(getenv('SLURM_CPUS_PER_TASK')))
Caused by:
Error using parallel.internal.pool.InteractiveClient>iThrowWithCause (line
670)
Failed to initialize the interactive session.
Error using
parallel.internal.pool.InteractiveClient>iThrowIfBadParallelJobStatus
(line 781)
The interactive communicating job failed with no message
Thank you for any pointers!
Mark
  2 Comments
Mark PIERCY
Mark PIERCY on 9 Feb 2021
Thanks Walter, I just submitted one.
Best,
Mark

Sign in to comment.

Accepted Answer

Mark PIERCY
Mark PIERCY on 2 Mar 2021
Edited: Mark PIERCY on 2 Mar 2021
This happens on either AMD or Intel nodes in our cluster. On our 128-core nodes, parpool(128) fails with an nproc limit at 32,768.
Our systems architect figured this out. It turns out that Matlab was seg faulting on the default nproc limit (max number of user processes), which is set by default to 4,096.
The issue is with the /etc/security/limits.d/20-nproc.conf provided by the PAM RPM on CentOS 7, which limits every user to 4096 processes at once. But not for Matlab.
Details:
>> parpool('local', 46)
Starting parallel pool (parpool) using the 'local' profile ...
*** Error in `/share/software/user/restricted/matlab/R2020a/bin/glnxa64/MATLAB': double free or corruption (!prev): 0x00007f4e4027d090 ***
*** Error in `/share/software/user/restricted/matlab/R2020a/bin/glnxa64/MATLAB*** Error in `/share/software/user/restricted/matlab/R2020a/bin/glnxa64/MATLAB': free(): corrupted unsorted chunks: 0x00007f4e401e10d0 ***
A bad case of segmentation fault:
[7191121.585896] MATLAB[120580]: segfault at 118c0f20 ip 00007fa1cd076b8d sp 00007fa1973fbc60 error 4 in libc-2.17.so[7fa1cd03d000+1c2000]
[7191121.610055] traps: MATLAB[120453] general protection ip:7fcb5d0b0b8d sp:7fcb273fbc60 error:0 in libc-2.17.so[7fcb5d077000+1c2000]

More Answers (0)

Categories

Find more on Cluster Configuration in Help Center and File Exchange

Products


Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!