This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Integration Scripts for Generic Schedulers

The generic scheduler interface provides complete flexibility to configure the interaction of the MATLAB® client, MATLAB workers, and a third-party scheduler. The integration scripts define how MATLAB interacts with your setup.

The following table lists the supported integration script functions and the stage at which they are evaluated:

File NameStage
independentSubmitFcn.mSubmitting an independent job
communicatingSubmitFcn.mSubmitting a communicating job
getJobStateFcn.mQuerying the state of a job
canceJobFcn.mCanceling a job
cancelTaskFcn.mCanceling a task
deleteJobFcn.mDeleting a job
deleteTaskFcn.mDeleting a task
postConstructFcn.mAfter creating a parallel.cluster.Generic instance

These integration scripts are evaluated only if they have the expected file name and are located in the folder specified by the IntegrationScriptsLocation property of the cluster. For more information about how to configure a generic cluster profile, see Configure Using the Generic Scheduler Interface (MATLAB Parallel Server).

Note

The independentSubmitFcn.m must exist to submit an independent job, and the communicatingSubmitFcn.m must exist to submit a communicating job.

Sample Integration Scripts

To support usage of the generic scheduler interface, integration scripts are available for the following third-party schedulers:

Each installer provides scripts for three possible submission modes:

  • Shared – The client can submit directly to the scheduler, and the client and the cluster nodes (or machines) have a shared file system.

  • Remote – The client and cluster nodes have a shared file system, but the client machine cannot submit directly to the scheduler, such as when the client utilities of the scheduler are not installed. This mode uses the ssh protocol to submit commands to the scheduler using a remote host.

  • Nonshared – The client and cluster nodes do not have a shared file system. This mode uses the ssh protocol to submit commands to the scheduler using a remote host, and it uses the sftp protocol to copy job and task files to the cluster file system.

Each submission mode has its own subfolder within the installation folder. This subfolder contains a README file that provides specific instructions on how to use the scripts. Before using the scripts, decide which submission mode describes your network setup.

To run the installer, download the appropriate support package for your scheduler, and open it in your MATLAB client. The installer includes a wizard to guide you through creating a cluster profile for your cluster configuration.

If your scheduler or cluster configuration is not supported by one of the support packages, it is recommended that you modify the scripts of one of these packages. For more information on how to write a set of integration scripts for generic schedulers, see Writing Custom Integration Scripts.

Wrapper Scripts

The sample integration scripts use wrapper scripts to simplify the implementation of independentSubmitFcn.m and communicatingSubmitFcn.m. These scripts are not required, however, using them is a good practice to make your code more readable. This table describes these scripts:

File nameDescription
independentJobWrapper.shUsed in independentSubmitFcn.m to embed a call to the MATLAB executable with the appropriate arguments. It uses environment variables for the location of the executable and its arguments. For an example of its use, see Sample script for a SLURM scheduler.
communicatingJobWrapper.shUsed in communicatingSubmitFcn.m to distribute a communicating job in your cluster. This script implements the steps in Submit scheduler job to launch MPI process. For an example of its use, see Sample script for a SLURM scheduler.

Writing Custom Integration Scripts

Note

When writing your own integration scripts, it is a good practice to start by modifying one of the sample integration scripts that most closely matches your setup (see Sample Integration Scripts).

independentSubmitFcn

When you submit an independent job to a generic cluster, the independentSubmitFcn.m function executes in the MATLAB client session.

The declaration line of this function must be:

function independentSubmitFcn(cluster,job,environmentProperties)

Each task in a MATLAB independent job corresponds to a single job on your scheduler. The purpose of this function is to submit N jobs to your third-party scheduler, where N is the number of tasks in the independent job. Each of these jobs must:

  1. Set the five environment variables required by the worker MATLAB to identify the individual task to run. For more information, see Configure the worker environment.

  2. Call the appropriate MATLAB executable to start the MATLAB worker and run the task. For more information, see Submit scheduler jobs to run MATLAB workers.

Configure the worker environment.  This table identifies the five environment variables and values that must be set on the worker MATLAB to run an individual task:

Environment Variable NameEnvironment Variable Value
MDCE_DECODE_FUNCTION'parallel.cluster.generic.independentDecodeFcn'
MDCE_STORAGE_CONSTRUCTORenvironmentProperties.StorageConstructor
MDCE_STORAGE_LOCATION
  • If you have a shared file system between the client and cluster nodes, use environmentProperties.StorageLocation .

  • If you do not have a shared file system between the client and cluster nodes, select a folder visible to all cluster nodes. For instructions on copying job and task files between client and cluster nodes, see Submitting without a Shared File System.

MDCE_JOB_LOCATIONenvironmentProperties.JobLocation
MDCE_TASK_LOCATIONenvironmentProperties.TaskLocation{n} for the nth task

Many schedulers support copying the client environment as part of the submission command. If so, you can set the previous environment variables in the client, so the scheduler can copy them to the worker environment. If not, you must modify your submission command to forward these variables.

Submit scheduler jobs to run MATLAB workers.  Once the five required parameters for a given job and task are defined on a worker, the task is run by calling the MATLAB executable with suitable arguments. The MATLAB executable to call is defined in environmentProperties.MatlabExecutable. The arguments to pass are defined in environmentProperties.MatlabArguments.

Note

If you cannot submit directly to your scheduler from the client machine, see Submitting from a Remote Host for instructions on how to submit using ssh.

Sample script for a SLURM scheduler.  This script shows a basic submit function for a SLURM scheduler with a shared file system. For a more complete example, see the sample support scripts in Sample Integration Scripts.

function independentSubmitFcn(cluster,job,environmentProperties)
    % Specify the required environment variables.
    setenv('MDCE_DECODE_FUNCTION', 'parallel.cluster.generic.independentDecodeFcn');
    setenv('MDCE_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor);
    setenv('MDCE_STORAGE_LOCATION', environmentProperties.StorageLocation);
    setenv('MDCE_JOB_LOCATION', environmentProperties.JobLocation);
    
    % Specify the MATLAB executable and arguments to run on the worker.
    % These are used in the independentJobWrapper.sh script.
    setenv('MDCE_MATLAB_EXE', environmentProperties.MatlabExecutable);
    setenv('MDCE_MATLAB_ARGS', environmentProperties.MatlabArguments);
    
    for ii = 1:environmentProperties.NumberOfTasks
        % Specify the environment variable required to identify which task to run.
        setenv('MDCE_TASK_LOCATION', environmentProperties.TaskLocations{ii});
        % Specify the command to submit the job to the SLURM scheduler.
        % SLURM will automatically copy environment variables to workers.
        commandToRun = 'sbatch --ntasks=1 independentJobWrapper.sh';
        [cmdFailed, cmdOut] = system(commandToRun);
    end
end
 

The previous example submits a simple bash script, independentJobWrapper.sh, to the scheduler. The independentJobWrapper.sh script embeds the MATLAB executable and arguments using environment variables:

#!/bin/sh
# MDCE_MATLAB_EXE - the MATLAB executable to use
# MDCE_MATLAB_ARGS - the MATLAB args to use
exec "${MDCE_MATLAB_EXE}" ${MDCE_MATLAB_ARGS}

communicatingSubmitFcn

When you submit a communicating job to a generic cluster, the communicatingSubmitFcn.m function executes in the MATLAB client session.

The declaration line of this function must be:

function communicatingSubmitFcn(cluster,job,environmentProperties)

The purpose of this function is to submit a single job to your scheduler. This job must:

  1. Set the four environment variables required by the MATLAB workers to identify the job to run. For more information, see Configure the worker environment.

  2. Call MPI to distribute your job to N MATLAB workers. N corresponds to the maximum value specified in the NumWorkersRange property of the MATLAB job. For more information, see Submit scheduler job to launch MPI process.

Configure the worker environment.  This table identifies the four environment variables and values that must be set on the worker MATLAB to run a task of a communicating job:

Environment Variable NameEnvironment Variable Value
MDCE_DECODE_FUNCTION'parallel.cluster.generic.communicatingDecodeFcn'
MDCE_STORAGE_CONSTRUCTORenvironmentProperties.StorageConstructor
MDCE_STORAGE_LOCATION
  • If you have a shared file system between the client and cluster nodes, use environmentProperties.StorageLocation .

  • If you do not have a shared file system between the client and cluster nodes, select a folder which exists on all cluster nodes. For instructions on copying job and task files between client and cluster nodes, see Submitting without a Shared File System.

MDCE_JOB_LOCATIONenvironmentProperties.JobLocation

Many schedulers support copying the client environment as part of the submission command. If so, you can set the previous environment variables in the client, so the scheduler can copy them to the worker environment. If not, you must modify your submission command to forward these variables.

Submit scheduler job to launch MPI process.  After you define the four required parameters for a given job, run your job by launching N worker MATLAB processes using mpiexec. mpiexec is software shipped with the Parallel Computing Toolbox™ that implements the Message Passing Interface (MPI) standard to allow communication between the worker MATLAB processes. For more information about mpiexec, see the MPICH home page.

To run your job, you must submit a job to your scheduler, which executes the following steps. Note that matlabroot refers to the MATLAB installation location on your worker nodes.

  1. Request N processes from the scheduler. N corresponds to the maximum value specified in the NumWorkersRange property of the MATLAB job.

  2. Call mpiexec to start worker MATLAB processes. The number of worker MATLAB processes to start on each host should match the number of processes allocated by your scheduler. The mpiexec executable is located at matlabroot/bin/mw_mpiexec.

    The mpiexec command automatically forwards environment variables to the launched processes. Therefore, ensure the environment variables listed in Configure the worker environment are set before running mpiexec.

    To learn more about options for mpiexec, see Using the Hydra Process Manager.

Note

For a complete example of the previous steps, see the communicatingJobWrapper.sh script provided with any of the sample integration scripts in Sample Integration Scripts. Use this script as a starting point if you need to write your own script.

Sample script for a SLURM scheduler.  The following script shows a basic submit function for a SLURM scheduler with a shared file system.

The submitted job is contained in a bash script, communicatingJobWrapper.sh. This script implements the relevant steps in Submit scheduler job to launch MPI process for a SLURM scheduler. For a more complete example, see the sample support scripts in Sample Integration Scripts.

function communicatingSubmitFcn(cluster,job,environmentProperties)
    % Specify the four required environment variables.
    setenv('MDCE_DECODE_FUNCTION', 'parallel.cluster.generic.communicatingDecodeFcn');
    setenv('MDCE_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor);
    setenv('MDCE_STORAGE_LOCATION', environmentProperties.StorageLocation);
    setenv('MDCE_JOB_LOCATION', environmentProperties.JobLocation);
    
    % Specify the MATLAB executable and arguments to run on the worker.
    % Specify the location of the MATLAB install on the cluster nodes.
    % These are used in the communicatingJobWrapper.sh script.
    setenv('MDCE_MATLAB_EXE', environmentProperties.MatlabExecutable);
    setenv('MDCE_MATLAB_ARGS', environmentProperties.MatlabArguments);
    setenv('MDCE_CMR', cluster.ClusterMatlabRoot);
    
    numberOfTasks = environmentProperties.NumberOfTasks;
    
    % Specify the command to submit a job to the SLURM scheduler which
    % requests as many processes as tasks in the job.
    % SLURM will automatically copy environment variables to workers.
    commandToRun = sprintf('sbatch --ntasks=%d communicatingJobWrapper.sh', numberOfTasks);
    [cmdFailed, cmdOut] = system(commandToRun);
end

getJobStateFcn

When you query the state of a job created with a generic cluster, the getJobStateFcn.m function executes in the MATLAB client session. The declaration line of this function must be:

function state = getJobStateFcn(cluster,job,state)

When using a third-party scheduler, it is possible that the scheduler can have more up-to-date information about your jobs than what is available to the toolbox from the local job storage location. This situation is especially true if using a nonshared file system, where the remote file system could be slow in propagating large data files back to your local data location.

To retrieve that information from the scheduler, add a function called getJobStateFcn.m to the IntegrationScriptsLocation of your cluster.

The state passed into this function is the state derived from the local job storage. The body of this function can then query the scheduler to determine a more accurate state for the job and return it in place of the stored state. The function you write for this purpose must return a valid value for the state of a job object. Allowed values are ‘pending’, ‘queued’, ‘running’, ‘finished’, or ‘failed’.

For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.

cancelJobFcn

When you cancel a job created with a generic cluster, the cancelJobFcn.m function executes in the MATLAB client session. The declaration line of this function must be:

function OK = cancelJobFcn(cluster,job)

When you cancel a job created using the generic scheduler interface, by default this action affects only the job data in storage. To cancel the corresponding jobs on your scheduler, you must provide instructions on what to do and when to do it to the scheduler. To achieve this, add a function called cancelJobFcn.m to the IntegrationScriptsLocation of your cluster.

The body of this function can then send a command to the scheduler, for example, to remove the corresponding jobs from the queue. The function must return a logical scalar indicating the success or failure of canceling the jobs on the scheduler.

For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.

cancelTaskFcn

When you cancel a task created with a generic cluster, the cancelTaskFcn.m function executes in the MATLAB client session. The declaration line of this function must be:

function OK = cancelTaskFcn(cluster,task)

When you cancel a task created using the generic scheduler interface, by default, this affects only the task data in storage. To cancel the corresponding job on your scheduler, you must provide instructions on what to do and when to do it to the scheduler. To achieve this, add a function called cancelTaskFcn.m to the IntegrationScriptsLocation of your cluster.

The body of this function can then send a command to the scheduler, for example, to remove the corresponding job from the scheduler queue. The function must return a logical scalar indicating the success or failure of canceling the job on the scheduler.

For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.

deleteJobFcn

When you delete a job created with a generic cluster, the deleteJobFcn.m function executes in the MATLAB client session. The declaration line of this function must be:

function deleteTaskFcn(cluster,task)

When you delete a job created using the generic scheduler interface, by default, this affects only the job data in storage. To remove the corresponding jobs on your scheduler, you must provide instructions on what to do and when to do it to the scheduler. To achieve this, add a function called deleteJobFcn.m to the IntegrationScriptsLocation of your cluster.

The body of this function can then send a command to the scheduler, for example, to remove the corresponding jobs from the scheduler queue.

For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.

deleteTaskFcn

When you delete a task created with a generic cluster, the deleteTaskFcn.m function executes in the MATLAB client session. The declaration line of this function must be:

function deleteTaskFcn(cluster,task)

When you delete a task created using the generic scheduler interface, by default, this affects only the task data in storage. To remove the corresponding job on your scheduler, you must provide instructions on what to do and when to do it to the scheduler. To achieve this, add a function called deleteTaskFcn.m to the IntegrationScriptsLocation of your cluster.

The body of this function can then send a command to the scheduler, for example, to remove the corresponding job from the scheduler queue.

For instructions on pairing MATLAB tasks with their corresponding scheduler job ID, see Managing Jobs with Generic Scheduler.

postConstructFcn

After you create an instance of your cluster in MATLAB, the postConstructFcn.m function executes in the MATLAB client session. For example, the following line of code creates an instance of your cluster and runs the postConstructFcn function associated with the ‘myProfile’ cluster profile:

c = parcluster('myProfile');

The declaration line of the postConstructFcn function must be:

function postConstructFcn(cluster)

If you need to perform custom configuration of your cluster before its use, add a function called postConstructFcn.m to the IntegrationScriptsLocation of your cluster. The body of this function can contain any extra setup steps you require.

Adding User Customization

If you need to modify the functionality of your integration scripts at run time, then use the AdditionalProperties property of the generic scheduler interface.

As an example, consider the SLURM scheduler. The submit command for SLURM accepts a –-nodelist argument that allows you to specify the nodes you want to run on. You can change the value of this argument without having to modify your integration scripts. To add this functionality, include the following code pattern in your independentSubmitFcn.m and communicatingSubmitFcn.m scripts:

% Basic SLURM submit command
submitCommand = 'sbatch';
 
% Check if property is defined
if isprop(cluster.AdditionalProperties, 'NodeList')
    % Add appropriate argument and value to submit string
    submitCommand = [submitCommand ' --nodelist=' cluster.AdditionalProperties.NodeList];
end 

For an example of how to use this coding pattern, see the nonshared submit functions of the example support scripts in Sample Integration Scripts.

Setting AdditionalProperties from the Cluster Profile Manager

With the modification to your scripts in the previous example, you can add an AdditionalProperties entry to your generic cluster profile to specify a list of nodes to use. This provides a method of documenting customization added to your integration scripts for anyone you share the cluster profile with.

To add the NodeList property to your cluster profile:

  1. Start the Cluster Profile Manager from the MATLAB desktop by selecting Parallel > Manage Cluster Profiles.

  2. Select the profile for your generic cluster, and click Edit.

  3. Navigate to the AdditionalProperties table, and click Add.

  4. Enter NodeList as the Name.

  5. Set String as the Type.

  6. Set the Value to the list of nodes.

Setting AdditionalProperties from the MATLAB Command Line

With the modification to your scripts in Adding User Customization, you can edit the list of nodes from the MATLAB command line by setting the appropriate property of the cluster object before submitting a job:

c = parcluster;
c.AdditionalProperties.NodeList = 'gpuNodeName';
j = c.batch('myScript'); 

Display the AdditionalProperties object to see all currently defined properties and their values:

>> c.AdditionalProperties
ans = 
  AdditionalProperties with properties:
                 ClusterHost: 'myClusterHost'
                    NodeList: 'gpuNodeName'
    RemoteJobStorageLocation: '/tmp/jobs'

Managing Jobs with Generic Scheduler

The first requirement for job management is to identify the jobs on the scheduler corresponding to a MATLAB job object. When you submit a job to the scheduler, the command that does the submission in your submit function can return some data about the job from the scheduler. This data typically includes a job ID. By storing that scheduler job ID with the MATLAB job object, you can later refer to the scheduler job by this job ID when you send management commands to the scheduler. Similarly, you can store a map of MATLAB task IDs to scheduler job IDs to help manage individual tasks. The toolbox function that stores this cluster data is setJobClusterData.

Save Job Scheduler Data

This example shows how to modify the independentSubmitFcn.m function to parse the output of each command submitted to a SLURM scheduler. You can use regular expressions to extract the scheduler job ID for each task and then store it using setJobClusterData.

% Pattern to extract scheduler job ID from SLURM sbatch output
searchPattern = '.*Submitted batch job ([0-9]+).*';
 
jobIDs = cell(numberOfTasks, 1);
for ii = 1:numberOfTasks
    setenv('MDCE_TASK_LOCATION', environmentProperties.TaskLocations{ii});
    commandToRun = 'sbatch --ntasks=1 independentJobWrapper.sh';
    [cmdFailed, cmdOut] = system(commandToRun);
    jobIDs{ii} = regexp(cmdOut, searchPattern, 'tokens', 'once');
end
 
% set the job IDs on the job cluster data
cluster.setJobClusterData(job, struct('ClusterJobIDs', {jobIDs}));

Retrieve Job Scheduler Data

This example modifies the cancelJobFcn.m to cancel the corresponding jobs on the SLURM scheduler. The example uses getJobClusterData to retrieve job scheduler data.

function OK = cancelJobFcn(cluster, job)

% Get the scheduler information for this job
data = cluster.getJobClusterData(job);
jobIDs = data.ClusterJobIDs;

for ii = 1:length(jobIDs)
    % Tell the SLURM scheduler to cancel the job
    commandToRun = sprintf('scancel ''%s''', jobIDs{ii});
    [cmdFailed, cmdOut] = system(commandToRun);
end

OK = true;

Submitting from a Remote Host

If the MATLAB client is unable to submit directly to your scheduler, use parallel.cluster.RemoteClusterAccess to establish a connection and run commands on a remote host.

This object uses the ssh protocol, and hence requires an ssh daemon service running on the remote host. To establish a connection, you must either provide a user name and password for the remote host, or a valid identity file.

The following code executes a command on a remote host, remoteHostname, as the user, user.

% This will prompt for the password of user
access = parallel.cluster.RemoteClusterAccess.getConnectedAccess('remoteHostname', 'user');
% Execute a command on remoteHostname
[cmdFailed, cmdOut] = access.runCommand(commandToRun);

For an example of integration scripts using remote host submission, see the remote folder of the example support scripts in Sample Integration Scripts.

Submitting without a Shared File System

If the MATLAB client does not have a shared file system with the cluster nodes, use parallel.cluster.RemoteClusterAccess to establish a connection and copy job and task files between the client and cluster nodes.

This object uses the ssh protocol, and hence requires an ssh daemon service running on the remote host. To establish a connection, you must either provide a user name and password for the remote host or a valid identity file.

When using nonshared submission, you must specify both a local job storage location to use on the client and a remote job storage location to use on the cluster. The remote job storage location must be available to all nodes of the cluster.

parallel.cluster.RemoteClusterAccess uses file mirroring to continuously synchronize the local job and task files with those on the cluster. When file mirroring first starts, local job and task files are uploaded to the remote job storage location. As the job executes, the file mirroring continuously checks the remote job storage location for new files and updates, and copies the files to the local storage on the client. This procedure ensures the MATLAB client always has an up-to-date view of the jobs and tasks executing on the scheduler.

This example connects to the remote host, remoteHostname, as the user, user, and establishes /remote/storage as the remote cluster storage location to synchronize with. It then starts file mirroring for a job, copying the local files of the job to /remote/storage on the cluster, and then syncing any changes back to the local machine.

% This will prompt for the password of user
access = parallel.cluster.RemoteClusterAccess.getConnectedAccessWithMirror('remoteHostname', '/remote/storage', 'user');
% Start file mirroring for a job
access.startMirrorForJob(job); 

For an example of integration scripts without a shared file system, see the nonshared folder of the example support scripts in Sample Integration Scripts.

Related Topics