Deep Q-Network (DQN) Agent

The deep Q-network (DQN) algorithm is an off-policy reinforcement learning method for environments with discrete action spaces. A DQN agent trains a Q-value function to estimate the expected discounted cumulative long-term reward when following the optimal policy. DQN is a variant of Q-learning that features a target critic and an experience buffer. The DQN agent supports offline training (training from saved data, without an environment).For more information on Q-learning, see Q-Learning Agent. For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.

In Reinforcement Learning Toolbox™, a DQN agent is implemented by an rlDQNAgent object.

DQN agents can be trained in environments with the following observation and action spaces.

Observation Space	Action Space
Discrete, continuous, or hybrid.	Discrete

DQN agents use the following critic.

Critic	Actor
Q-value function critic Q(S,A), which you create using `rlQValueFunction` or `rlVectorQValueFunction`	DQN agents do not use an actor.

During training, the agent:

Updates the critic learnable parameters at each time step during learning.
Explores the action space using epsilon-greedy exploration. During each control interval, the agent either selects a random action with probability ϵ or selects an action greedily with respect to the action-value function with probability 1-ϵ. The greedy action is the action for which the action-value function is greatest.
Stores past experiences using a circular experience buffer. The agent updates the critic based on a mini-batch of experiences randomly sampled from the buffer.

Critics Used by the DQN Agent

To estimate the value of the optimal policy, a DQN agent uses two parametrized action-value functions, each maintained by a corresponding critic.

Critic Q(S,A;ϕ) — Given observation S and action A this critic stores the corresponding estimate of the expected discounted cumulative long-term reward when following the optimal policy (this is the value of the optimal policy).
Target critic Q_t(S,A;ϕ_t) — To improve the stability of the optimization, the agent periodically updates the target critic learnable parameters ϕ_t using the latest critic parameter values.

Both Q(S,A;ϕ) and Q_t(S,A;ϕ_t) are implemented by function approximator objects having the same structure and parameterization. During training, the training algorithm tunes the critics parameter values to improve their action-value function estimation. After training, the parameters remain at their tuned values in the critics internal to the trained agent.

For more information on critics, see Create Actors, Critics, and Policy Objects.

DQN Agent Creation

You can create and train DQN agents at the MATLAB^® command line or using the Reinforcement Learning Designer app. For more information on creating agents using Reinforcement Learning Designer, see Create Agents Using Reinforcement Learning Designer.

At the command line, you can create a default DQN agent based on the observation and action specifications from the environment. A default DQN agent uses function default approximators that rely on a deep neural network model. To do so, perform the following steps.

Create observation specifications for your environment. If you already have an environment object, you can obtain these specifications using getObservationInfo.
Create action specifications for your environment. If you already have an environment object, you can obtain these specifications using getActionInfo.
If needed, specify the number of neurons in each learnable layer (the default is 256 neurons) or whether to use an LSTM layer (by default no LSTM layer is used). To do so, create an agent initialization option object using rlAgentInitializationOptions.
If needed, specify agent options using an rlDQNAgentOptions object. Alternatively, you can skip this step and modify the agent options later using dot notation.
Create the agent using rlDQNAgent.

Alternatively, you can create a critic and use it to create your agent. In this case, ensure that the dimensions of the observation and action layers in the critic match the corresponding action and observation specifications of the environment.

Create observation specifications for your environment. If you already have an environment object, you can obtain these specifications using getObservationInfo.
Create action specifications for your environment. If you already have an environment object, you can obtain these specifications using getActionInfo.
Create an approximation model for your critic. Depending on the type of problem and on the specific critic you use in the next step, this model can be an rlTable object (only for discrete observation spaces), a custom basis function with initial parameter values, or a neural network object. The inputs and outputs of the model you create depend on the type of critic you use in the next step.
Create a critic using rlQValueFunction or rlVectorQValueFunction. Use the model you created in the previous step as a first input argument.
Specify agent options using an rlDQNAgentOptions object. Alternatively, you can skip this step and modify the agent options later using dot notation.
Create the agent using rlDQNAgent.

DQN agents support critics that use recurrent deep neural networks as functions approximators.

For more information on creating actors and critics for function approximation, see Create Actors, Critics, and Policy Objects.

DQN Agent Initialization

When you create a DQN agent, the critic Q(S,A;ϕ) uses random parameter values in ϕ. The agent then initializes the target critic parameters ϕ_t with the same values.

The agent uses this initial critic parameters at the beginning of the first training session. For each subsequent training session, the critics retain the parameters from the previous session.

DQN Training Algorithm

DQN agents use the following training algorithm, in which they update their critic model at each time step. To configure the training algorithm, specify options using an rlDQNAgentOptions object.

Perform a warm start by taking a sequence of actions following an epsilon-greedy policy:
1. At the beginning of each episode, get the initial observation from the environment.
2. For the current observation S, select a random action A with probability ϵ. Otherwise, select the action for which the critic value function is greatest.
  $A = \underset{A}{\arg \max} Q (S, A; ϕ)$
  To specify ϵ and its decay rate, use the EpsilonGreedyExploration option.
3. Execute action A. Observe the reward R and the next observation S'.
4. Store the experience (S,A,R,S') in the experience buffer.
5. If ϵ is greater than its minimum value, perform the decay operation as described in EpsilonGreedyExploration.
To specify the size of the experience buffer, use the ExperienceBufferLength option in the agent rlDQNAgentOptions object. To specify the number of warm up actions, use the NumWarmStartSteps option.
After the warm start procedure, for each training time step:
1. Execute the five operations described in the warm start procedure.
2. Every D_C time steps (to specify D_C use the LearningFrequency option) perform the following two operations for NumEpoch times:
  1. Using all the collected experiences, create a maximum of B different mini-batches. To specify B, use the MaxMiniBatchPerEpoch option. Each mini-batch contains M different (typically nonconsecutive) experiences (S_i,A_i,R_i,S'_i) that are randomly sampled from the experience buffer (each experience can only be part of one mini-batch). To specify M, use the MiniBatchSize option.
    If the agent contains recurrent neural networks, each mini-batch contains M different sequences. Each sequence contains K consecutive experiences (starting from a randomly sampled experience). To specify K, use the SequenceLength option.
  2. For each (randomly selected) mini-batch, perform the learning operations described in Mini-Batch Learning Operations.
  When LearningFrequency has the default value of -1, the mini-batches creation (described in point a) and the learning operations (described in point b) are executed after each episode is finished.

Mini-Batch Learning Operations

Operations performed for each mini-batch.

Update the probability threshold ϵ for selecting a random action based on the decay rate you specify in the EpsilonGreedyExploration option.
Update the critic parameters by one-step minimization of the loss L_k across all sampled experiences.
$L_{k} = \frac{1}{2 M} \sum_{i = 1}^{M} {(y_{i} - Q_{k} (S_{i}, A_{i}; ϕ_{k}))}^{2}$
To specify the optimizer options used to minimize L_k, use the options contained in the CriticOptimizerOptions option (which in turn contains an rlOptimizerOptions object).
If the agent contains recurrent neural networks, each element of the sum over the batch elements is itself a sum over the time (sequence) dimension.
If S'_i is a terminal state, set the value function target y_i to R_i. Otherwise, set it to
$\begin{array}{l} \begin{array}{l} A_{\max} = \underset{A'}{\arg \max} Q (S_{i}', A'; ϕ) \\ y_{i} = R_{i} + γ Q_{t} (S_{i}', A_{\max}; ϕ_{t}) \end{array} & (double DQN) \\ y_{i} = R_{i} + γ \max_{A'} Q_{t} (S_{i}', A'; ϕ_{t}) & (DQN) \end{array}$
Here, the normal DQN algorithm selects the action that maximizes the action-value function maintained by the target critic, while the double DQN selects the action that maximizes the action-value function maintained by the base critic.
To set the discount factor γ, use the DiscountFactor option. To use double DQN, set the UseDoubleDQN option to true.
If you specify a value of NumStepsToLookAhead equal to N, then the N-step return (which adds the rewards of the following N steps and the discounted estimated value of the state that caused the N-th reward) is used to calculate the target y_i.
At every TargetUpdateFrequency critic updates, update the target critic parameters depending on the target update method. For more information, see Target Update Methods.

Target Update Methods

DQN agents update their target critic parameters using one of the following target update methods.

Smoothing — Update the target parameters at every time step using smoothing factor τ. To specify the smoothing factor, use the TargetSmoothFactor option.
$ϕ_{t} = τ ϕ + (1 - τ) ϕ_{t}$
Periodic — Update the target parameters periodically without smoothing (TargetSmoothFactor = 1). To specify the update period, use the TargetUpdateFrequency parameter.
Periodic Smoothing — Update the target parameters periodically with smoothing.

To configure the target update method, create a rlDQNAgentOptions object, and set the TargetUpdateFrequency and TargetSmoothFactor parameters as shown in the following table.

Update Method	`TargetUpdateFrequency`	`TargetSmoothFactor`
Smoothing (default)	`1`	Less than `1`
Periodic	Greater than `1`	`1`
Periodic smoothing	Greater than `1`	Less than `1`

References

[1] Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. “Playing Atari with Deep Reinforcement Learning.” arXiv, December 19, 2013. https://doi.org/10.48550/arXiv.1312.5602.