Why am I unable to validate my Slurm configuration in the Parallel Computing Toolbox?

8 views (last 30 days)
I have MATLAB Parallel Server set up on a Linux cluster running Slurm. When I attempt to validate the cluster configuration it fails.

Accepted Answer

MathWorks Support Team
MathWorks Support Team on 7 Apr 2023
Edited: MathWorks Support Team on 14 Apr 2023
There are several issues that can prevent the validation of the cluster. Run the following tests below to make sure that your configuration is setup properly. If at any point you receive an error message, you can submit a request to Installation support using the link at the bottom of the page. When submitting a request, be sure to include the following:
  • Your license number
  • The release of MATLAB on the client and the cluster
  • The output of your validation (click details to get the full information)
  • The results of the tests below
1) Test the licensing of MATLAB Parallel Server
The first step is to ensure that the licensing for MATLAB Parallel Server works on your cluster. This will also test to see if MATLAB is crashing on startup on your cluster. To test this, go to one of the cluster nodes and open up a Terminal window, then run the following commands:
cd $MATLAB/bin (where $MATLAB is the installation folder for MATLAB on the cluster)
./matlab -dmlworker -nodisplay -logfile /var/tmp/output.txt -r "ver;exit"
This will generate an output.txt file in /var/tmp that contains the ver output on the cluster.  If the log file contains a network license manager error, this is the issue. In that case, check MATLAB Answers for the license manager error number and take the appropriate action to resolve the license error before proceeding.
2) Check the releases of MATLAB on the cluster and the client where you validated
If you get the output of the "ver" command in the log file, check the releases (R20XXx) of all the products in the list. The release of each product should match for all the products. Additionally, the release should match the release that is installed on the client where you ran the validation. To check the release on the client, run the ver command in MATLAB's command window. If the release of Parallel Computing Toolbox and MATLAB do not match the release of MATLAB and MATLAB Parallel Server on the cluster, you will not be able to use this configuration until the installations are at the same release.
3) Check to make sure that your configuration meets the scheduler requirements
In order to use MATLAB Parallel Server with Slurm, there are some additional requirements in the setup:
  • The scheduler binaries do not need to be accessible from the MATLAB client that runs Parallel Computing Toolbox. If the client does not have the binaries, you can submit jobs by utilizing the nonshared configuration on the MATLAB client or by remotely accessing one of the cluster nodes to run the MATLAB client.
  • Your cluster should be completely homogeneous; Slurm currently only supports Linux. Mixing different platforms or distributions is not recommended especially for parallel computation.
  • This configuration requires that the data for the jobs be stored on a shared file space between the clients and the cluster nodes. When creating the configuration, set the "JobStorageLocation" property to be a path that is accessible to all computers.
  • The MATLAB client machine does not have to be the same operating system as the cluster.
  • Passwordless SSH access between compute nodes must be setup in order to submit jobs.
For more information, consult the Slurm FAQ here:
4) Check to ensure you have correctly configured the client configuration
In your MATLAB client, go to the Parallel menu to Manage Cluster Profiles. Click on your Generic Profile for Slurm  configuration and then click Edit. You must set the appropriate values for ClusterMatlabRoot (the directory where is MATLAB installed on the cluster), JobStorageLocation (where the data will be stored, must be accessible from the same path from all computers), and HasSharedFilesystem (should be set to True).
For more information on filling out the Generic Profile, reference the following guide:
If you have confirmed all of the settings above, do all stages fail during validation, or just parallel and matlabpool/parpool? If you are able to pass the Distributed Job phase, the validation may be reporting false errors. To confirm you can manually validate your cluster follow the instructions in the following article:
If the manual tests passed, your configuration is working and you should be able to submit jobs.
If you still have issues, contact support .
NOTE: Starting in R2019a the following name changes occurred:
  • MATLAB Distributed Computing Server was renamed to MATLAB Parallel Server 
  • mdce_def was renamed to mjs_def
  • mdce binary was renamed to mjs

More Answers (0)

Tags

No tags entered yet.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!