Train Agents Using Parallel Computing and GPUs
If you have Parallel Computing Toolbox™ software, you can run parallel simulations on multicore processors or GPUs. If you additionally have MATLAB® Parallel Server™ software, you can run parallel simulations on computer clusters or cloud resources.
Note that parallel training and simulation of agents using recurrent neural networks, or agents within multi-agent environments, is not supported.
Independently on which devices you use to simulate or train the agent, once the agent has been trained, you can generate code to deploy the optimal policy on a CPU or GPU. This is explained in more detail in Deploy Trained Reinforcement Learning Policies.
Using Multiple Processes
When you train agents using parallel computing, the parallel pool client (the MATLAB process that starts the training) sends copies of both its agent and environment to each parallel worker. Each worker simulates the agent within the environment and sends their simulation data back to the client. The client agent learns from the data sent by the workers and sends the updated policy parameters back to the workers.
To create a parallel pool of
N workers, use the following
pool = parpool(N);
If you do not create a parallel pool using
parpool (Parallel Computing Toolbox), the
train function automatically creates one
using your default parallel pool preferences. For more information on specifying these
preferences, see Specify Your Parallel Preferences (Parallel Computing Toolbox). Note that using a parallel pool of thread workers,
pool = parpool("threads"), is not supported.
For more information on configuring your training to use parallel computing, see the
For an example on how to configure options for asynchronous advantage actor-critic (A3C)
agent training, see the last example in
For an example that trains an agent using parallel computing in MATLAB, see Train AC Agent to Balance Cart-Pole System Using Parallel Computing. For an example that trains an agent using parallel computing in Simulink®, see Train DQN Agent for Lane Keeping Assist Using Parallel Computing and Train Biped Robot to Walk Using Reinforcement Learning Agents.
Agent-Specific Parallel Training Considerations
For off-policy agents, such as DDPG and DQN agents, do not use all of your cores for parallel training. For example, if your CPU has six cores, train with four workers. Doing so provides more resources for the parallel pool client to compute gradients based on the experiences sent back from the workers. Limiting the number of workers is not necessary for on-policy agents, such as AC and PG agents, when the gradients are computed on the workers.
Gradient-Based Parallelization (AC and PG Agents)
When training AC and PG agents in parallel, both the environment simulation and gradient computations are done by the workers. Specifically, workers simulate the agent against the environment, compute the gradients from experiences, and send the gradients to the client. The client averages the gradients, updates the network parameters and sends the updated parameters back to the workers to they can continue simulating the agent with the new parameters.
This type of parallel training is also known as gradient-based parallelization, and
allows you to achieve, in principle, a speed improvement which is nearly linear in the
number of workers. However, this option requires synchronous training
(that is the
Mode property of the
object that you pass to the
must be set to
"sync"). This means that workers must pause execution
until all workers are finished, and as a result the training only advances as fast as the
slowest worker allows.
Experience-Based Parallelization (DQN, DDPG, PPO, TD3, and SAC agents)
When training DQN, DDPG, PPO, TD3, and SAC agents in parallel, the environment simulation is done by the workers and the learning is done by the client. Specifically, the workers simulate the agent against the environment, and send experience data (observation, action, reward, next observation, and a termination signal) to the client. The client then computes the gradients from experiences, updates the network parameters and sends the updated parameters back to the workers, which continue to simulate the agent with the new parameters.
This type of parallel training is also known as experience-based parallelization, and
can run using asynchronous training (that is the
Mode property of the
object that you pass to the
can be set to
Experience-based parallelization can reduce training time only when the computational cost of simulating the environment is high compared to the cost of optimizing network parameters. Otherwise, when the environment simulation is fast enough, the workers lie idle waiting for the client to learn and send back the updated parameters.
To sum up, experience-based parallelization can improve sample efficiency (intended as the number of samples an agent can process within a given time) only when the ratio R between the environment step complexity and the learning complexity is large. If both environment simulation and learning are similarly computationally expensive, experience-based parallelization is unlikely to improve sample efficiency. However, in this case, for off-policy agents that are supported in parallel (DQN, DDPG, TD3, and SAC) you can reduce the mini-batch size to make R larger, thereby improving sample efficiency.
To enforce contiguity in the experience buffer when training DQN, DDPG, TD3, or SAC
agents in parallel, set the
NumStepsToLookAhead property or the
corresponding agent option object to
1. A different value causes an
error when parallel training is attempted.
You can speed up training by performing representation operations (such as gradient
computation and prediction), on a local GPU rather than a CPU. To do so, when creating a
critic or actor, set its
UseDevice option to
"gpu" option requires both Parallel Computing Toolbox software and a CUDA® enabled NVIDIA® GPU. For more information on supported GPUs see GPU Computing Requirements (Parallel Computing Toolbox).
You can use
gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be used with
Using GPUs is likely to be beneficial when you have a deep neural network in the actor or critic which has large batch sizes or needs to perform operations such as multiple convolutional layers on input images.
For an example on how to train an agent using the GPU, see Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation.
Using both Multiple Processes and GPUs
You can also train agents using both multiple processes and a local GPU (previously
gpuDevice (Parallel Computing Toolbox)) at the same time. To do so, first create a critic or
actor approximator object in which the
UseDevice option is set to
"gpu". You can then use the critic and actor to create an agent, and
then train the agent using multiple processes. This is done by creating an
object in which
UseParallel is set to
passing it to the
For gradient-based parallelization, (which must run in synchronous mode) the environment simulation is done by the workers, which use their local GPU to calculate the gradients and perform a prediction step. The gradients are then sent back to the parallel pool client process which calculates the averages, updates the network parameters and sends them back to the workers so they continue to simulate the agent, with the new parameters, against the environment.
For experience-based parallelization, (which can run in asynchronous mode), the workers simulate the agent against the environment, and send experiences data back to the parallel pool client. The client then uses its local GPU to compute the gradients from the experiences, then updates the network parameters and sends the updated parameters back to the workers, which continue to simulate the agent, with the new parameters, against the environment.
Note that when using both parallel processing and GPU to train PPO agents, the workers use their local GPU to compute the advantages, and then send processed experience trajectories (which include advantages, targets and action probabilities) back to the client.