The deep deterministic policy gradient (DDPG) algorithm is a model-free, online, off-policy reinforcement learning method. A DDPG agent is an actor-critic reinforcement learning agent that computes an optimal policy that maximizes the long-term reward.

For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

DDPG agents can be trained in environments with the following observation and action spaces.

Observation Space | Action Space |
---|---|

Continuous or discrete | Continuous |

During training, a DDPG agent:

Updates the actor and critic properties at each time step during learning.

Stores past experience using a circular experience buffer. The agent updates the actor and critic using a mini-batch of experiences randomly sampled from the buffer.

Perturbs the action chosen by the policy using a stochastic noise model at each training step.

To estimate the policy and value function, a DDPG agent maintains four function approximators:

Actor

*μ*(*S*) — The actor takes observation*S*and outputs the corresponding action that maximizes the long-term reward.Target actor

*μ'*(*S*) — To improve the stability of the optimization, the agent periodically updates the target actor based on the latest actor parameter values.Critic

*Q*(*S*,*A*) — The critic takes observation*S*and action*A*as inputs and outputs the corresponding expectation of the long-term reward.Target critic

*Q'*(*S*,*A*) — To improve the stability of the optimization, the agent periodically updates the target critic based on the latest critic parameter values.

Both *Q*(*S*,*A*) and
*Q'*(*S*,*A*) have the same structure
and parameterization, and both *μ*(*S*) and
*μ'*(*S*) have the same structure and
parameterization.

When training is complete, the trained optimal policy is stored in actor
*μ*(*S*).

For more information on creating actors and critics for function approximation, see Create Policy and Value Function Representations.

To create a DDPG agent:

Create an actor representation object.

Create a critic representation object.

Specify agent options using the

`rlDDPGAgentOptions`

function.Create the agent using the

`rlDDPGAgent`

function.

For more information, see `rlDDPGAgent`

and
`rlDDPGAgentOptions`

.

DDPG agents use the following training algorithm, in which they update their actor and
critic models at each time step. To configure the training algorithm, specify options using
`rlDDPGAgentOptions`

.

Initialize the critic

*Q*(*S*,*A*) with random parameter values*θ*, and initialize the target critic with the same random parameter values: $${\theta}_{Q\text{'}}={\theta}_{Q}$$._{Q}Initialize the actor

*μ*(*S*) with random parameter values*θ*, and initialize the target actor with the same parameter values: $${\theta}_{\mu \text{'}}={\theta}_{\mu}$$._{μ}For each training time step:

For the current observation

*S*, select action*A*=*μ*(*S*) +*N*, where*N*is stochastic noise from the noise model. To configure the noise model, use the`NoiseOptions`

option.Execute action

*A*. Observe the reward*R*and next observation*S'*.Store the experience (

*S*,*A*,*R*,*S'*) in the experience buffer.Sample a random mini-batch of

*M*experiences (*S*,_{i}*A*,_{i}*R*,_{i}*S'*) from the experience buffer. To specify_{i}*M*, use the`MiniBatchSize`

option.If

*S'*is a terminal state, set the value function target_{i}*y*to_{i}*R*. Otherwise set it to:_{i}$${y}_{i}={R}_{i}+\gamma Q\text{'}\left({S}_{i}\text{'},\mu \text{'}\left({S}_{i}\text{'}|{\theta}_{\mu}\right)|{\theta}_{Q\text{'}}\right)$$

The value function target is the sum of the experience reward

*R*and the discounted future reward. To specify the discount factor_{i}*γ*, use the`DiscountFactor`

option.To compute the cumulative reward, the agent first computes a next action by passing the next observation

*S*from the sampled experience to the target actor. The agent finds the cumulative reward by passing the next action to the target critic._{i}'Update the critic parameters by minimizing the loss

*L*across all sampled experiences.$$L=\frac{1}{M}{\displaystyle \sum _{i=1}^{M}{\left({y}_{i}-Q\left({S}_{i},{A}_{i}|{\theta}_{Q}\right)\right)}^{2}}$$

Update the actor parameters using the following sampled policy gradient to maximize the expected discounted reward.

$$\begin{array}{l}{\nabla}_{{\theta}_{\mu}}J\approx \frac{1}{M}{\displaystyle \sum _{i=1}^{M}{G}_{ai}{G}_{\mu i}}\\ {G}_{ai}={\nabla}_{A}Q\left({S}_{i},A|{\theta}_{Q}\right)\text{\hspace{1em}}\text{where}\text{\hspace{0.17em}}A=\mu \left({S}_{i}|{\theta}_{\mu}\right)\\ {G}_{\mu i}={\nabla}_{{\theta}_{\mu}}\mu \left({S}_{i}|{\theta}_{\mu}\right)\end{array}$$

Here,

*G*is the gradient of the critic output with respect to the action computed by the actor network, and_{ai}*G*is the gradient of the actor output with respect to the actor parameters. Both gradients are evaluated for observation_{μi}*S*._{i}Update the target actor and critic depending on the target update method (smoothing or periodic). To select the update method, use the

`TargetUpdateMethod`

option.$$\begin{array}{ll}\begin{array}{l}{\theta}_{Q\text{'}}=\tau {\theta}_{Q}+\left(1-\tau \right){\theta}_{Q\text{'}}\\ {\theta}_{\mu \text{'}}=\tau {\theta}_{\mu}+\left(1-\tau \right){\theta}_{\mu \text{'}}\end{array}\hfill & \left(\text{smoothing}\right)\hfill \\ \hfill & \hfill \\ \begin{array}{l}{\theta}_{Q\text{'}}={\theta}_{Q}\\ {\theta}_{\mu \text{'}}={\theta}_{\mu}\end{array}\hfill & \left(\text{periodic}\right)\hfill \end{array}$$

By default, the agent uses target smoothing and updates the target actor and critic at every time step using smoothing factor

*τ*. To specify the smoothing factor, use the`TargetSmoothFactor`

option. Alternatively, you can update the target actor and critic periodically. To specify the number of episodes between target updates, use the`TargetUpdateFrequency`

option.

For simplicity, this actor and critic updates in this algorithm show a gradient update
using basic stochastic gradient descent. The actual gradient update method depends on the
optimizer specified using `rlRepresentationOptions`

.

[1] T. P. Lillicrap, J. J. Hunt, A.
Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. “Continuous control with
deep reinforcement learning,” *International Conference on Learning
Representations*, 2016.

`rlDDPGAgent`

| `rlDDPGAgentOptions`

| `rlRepresentation`