Offload Deep Learning Experiments as Batch Jobs to a Cluster
By default, Experiment Manager runs your experiments interactively, so you can monitor the progress of each trial by inspecting the results table and the training plot. However, running an experiment interactively limits your access to MATLAB® functionality. For example, when an experiment is running, you cannot close the project that contains the experiment or run other experiments.
If you have Parallel Computing Toolbox™ and MATLAB Parallel Server™, you can send your experiment as a batch job to a remote cluster. While the experiment is running in the cluster, you can:
Run another experiment interactively or start another batch job using the same experiment, a different experiment in the same project, or an experiment in a different project.
Close the Experiment Manager app and continue using MATLAB.
Close your MATLAB session.
If you only have Parallel Computing Toolbox, you can use a local cluster profile to develop and test your experiments on your client machine instead of running them on a network cluster. If you close your MATLAB session, any batch jobs using the local cluster profile also stop immediately.
Create Batch Job on Cluster
To start a batch job for your experiment:
Load data for your experiment from a location that is accessible to all your parallel workers. For example, store your data outside the project and access the data by using an absolute path. Alternatively, create a datastore object that can access the data on another machine by setting up the
AlternateFileSystemRootsproperty of the datastore. For more information, see Set Up Datastore for Processing on Different Machines or Clusters.
In the Experiment Manager toolstrip, under Execution, use the Mode list to specify an execution mode:
To run one trial of the experiment at a time, select
Batch Sequential. Deep learning experiments do not support this execution mode when you set the training option
To run multiple trials at the same time, select
Batch Simultaneous. Deep learning experiments do not support this execution mode when you set the training option
"parallel"or when you enable the training option
Use the Cluster list to select a cluster profile to use for your batch job. To create and manage cluster profiles, open the Cluster Profile Manager. For more information, see Discover Clusters and Use Cluster Profiles (Parallel Computing Toolbox).
In the Pool Size field, enter the number of workers for your batch job.
If Mode is
Batch Sequential, use this field to configure the number of parallel workers that collaborate on each trial of the experiment. If you set the pool size to
0, the experiment runs on a single worker.
If Mode is
Batch Simultaneous, use this field to specify the number of trials that the cluster runs at the same time.
Because Experiment Manager uses an additional worker to run the batch job, the cluster must have at least one more worker available than the number you specify in the Pool Size field. For example, if you specify a pool size of
2, the cluster must have at least three workers available: two workers for the experiment and an additional worker to run the batch job. For more information, see Run a Batch Job with a Parallel Pool (Parallel Computing Toolbox).
Click Run . Experiment Manager uses the
batch(Parallel Computing Toolbox) function to run the experiment in the specified cluster.
While the batch job runs your experiment, you can close Experiment Manager and recover the results later.
Track Progress of Batch Job
When you run a batch job for an experiment, Experiment Manager does not continually communicate with the cluster to update the values in the results table and save the visualizations for your experiment. Instead, you retrieve this information from the cluster by clicking the Refresh button above the results table.
To monitor batch jobs without opening the Experiment Manager app, use the Job Monitor, as described in Send Deep Learning Batch Job to Cluster. The Job Monitor tells you whether your batch job is queued, running, or finished.
Using the Job Monitor to cancel or delete jobs that you create with Experiment Manager can lead to unexpected behavior. Instead, cancel and delete these batch jobs by using Experiment Manager.
Cancel Batch Job
To cancel a batch job running an experiment, in the Experiment Manager
toolstrip, click Cancel
. Experiment Manager marks any running and queued
Canceled and discards their results.
Batch execution does not support stopping, canceling, or restarting individual trials of an experiment.
Download Training Results
To download the training results for a completed trial, in the Actions column of the results table, click the Download button for the trial.
Experiment Manager saves the training results that you download from the cluster:
For built-in training experiments, Experiment Manager downloads the trained network and training information from the cluster.
For custom training experiments, Experiment Manager downloads the training output from the cluster.
You can access these results after you close your MATLAB session.
After you download the training results from the cluster, you can export these results to the workspace to perform additional computations to evaluate the quality of the training:
For built-in training experiments, select Export > Trained Network or Export > Training Information.
For custom training experiments, select Export > Training Output.
Delete Batch Job
To avoid consuming resources unnecessarily, delete the job from the cluster by clicking the Clean up button above the results table. Deleting the batch job permanently discards any results that you have not downloaded from the cluster.
batch(Parallel Computing Toolbox)
- Send Deep Learning Batch Job to Cluster
- Run Batch Parallel Jobs (Parallel Computing Toolbox)
- Discover Clusters and Use Cluster Profiles (Parallel Computing Toolbox)
- Use Parallel Computing Toolbox with Cloud Center Cluster in MATLAB Online (Parallel Computing Toolbox)