rlConservativeQLearningOptions

Regularizer options object to train DQN and SAC agents

Since R2023a

Description

Use an rlConservativeQLearningOptions object to specify conservative Q-learning regularizer options to train a DQN or SAC agents. The options you can specify are the minimum weight and the number of random actions used for Q-value compensation, and are mostly useful to train agents offline (specifically to deal with possible differences between the probability distribution of the dataset and the one generated by the environment). To enable the conservative Q-learning regularizer when training an agent, set the BatchDataRegularizerOptions property of the agent options object to a rlConservativeQLearningOptions object (that has your preferred minimum weight and number of samples).

Creation

Syntax

cqOpts = rlConservativeQLearningOptions

cqOpts = rlConservativeQLearningOptions(PropertyName=Value)

Description

cqOpts = rlConservativeQLearningOptions returns a default conservative Q-learning regularizer options object.

cqOpts = rlConservativeQLearningOptions(PropertyName=Value) creates the conservative Q-learning regularizer option set cqOpts and sets its properties using one or more name-value arguments.

example

Properties

expand all

`MinQValueWeight` — Weight used for Q-value compensation
`1` (default) | positive scalar

Weight used for Q-value compensation, specified as a positive scalar. For more information, see Algorithms.

Example: MinQValueWeight=0.1

`NumSampledActions` — Number of sampled actions used for Q-value compensation
`10` (default) | positive integer

Number of sampled actions used for Q-value compensation, specified as a positive integer. This is the number of random actions used to estimate the logarithm of the sum of Q-values for the SAC agent. For more information, see Continuous Actions Regularizer (SAC).

Example: NumSampledActions=30

Object Functions

Examples

collapse all

Create Conservative Q-Learning Options Object

Open Live Script

Create an rlConservativeQLearningOptions object specifying the weight to be used for Q-value compensation.

opt = rlConservativeQLearningOptions( ...
    MinQValueWeight=5)

opt = 
  rlConservativeQLearningOptions with properties:

      MinQValueWeight: 5
    NumSampledActions: 10

You can modify the options using dot notation. For example, set NumSampledActions to 20.

opt.NumSampledActions = 20;

To specify this behavioral cloning option set for an agent, first create the agent options object. For this example, create a default rlDQNAgentOptions object for a DQN agent.

agentOpts = rlDQNAgentOptions;

Then, assign the rlBehaviorCloningRegularizerOptions object to the BatchDataRegularizerOptions property.

agentOpts.BatchDataRegularizerOptions  = opt;

When you create the agent, use agentOpts as the last input argument for the agent constructor function rlDQNAgent.

Algorithms

expand all

In conservative Q-learning the regularizer that is added to the critic loss relies on the difference between the expected Q-values of the actions from the current policy and the Q-values of the actions from the data set.

Discrete Actions Regularizer (DQN)

For an agent with a discrete action space, the resulting loss function that the agents minimizes is the following:

$L = W_{c q} \frac{1}{M} \sum_{i = 1}^{M} (\log (\sum_{a \in A} \exp (Q (s_{i}, a))) - Q (s_{i}, a_{i}) + \frac{1}{2} {(Q (s_{i}, a_{i}) - y_{i})}^{2})$

Here, A is the set of all possible actions, M is the number of experiences in the mini-batch, s_i is an observation stored in the mini-batch, and y_i the target Q-value corresponding to Q(s_i,a_i).

To set W_cq, assign a value to the MinQValueWeight property of the rlConservativeQLearningOptions object. For more information, see [1].

Continuous Actions Regularizer (SAC)

Similarly to the discrete action case, for an agent with a continuous action space, the resulting loss function that the agents minimizes is the following:

$L = \frac{1}{M} \sum_{i = 1}^{M} (W_{c q} (\log (\sum_{A} \exp (Q (s_{i}, a))) - Q (s_{i}, a_{i})) + \frac{1}{2} {(Q (s_{i}, a_{i}) - y_{i})}^{2})$

Here A is the (continuous) action set, M the number of experiences in the mini-batch, s_i an observation stored in the mini-batch, and y_i the target Q-value corresponding to Q(s_i,a_i). The first, logarithmic, term in the sum is properly defined for a continuous action space in [1] and is approximated as follows:

$\log (\sum_{A} \exp (Q (s_{i}, a))) \approx \log (\frac{1}{2 N} \sum_{a_{k} \sim U n i f (A_{m n}, A_{m x})}^{N} \frac{\exp (Q (s_{i}, a_{k}))}{f_{U n i f (A_{m n}, A_{m x})} (a_{k})} + \frac{1}{2 N} \sum_{a_{h} \sim π (\cdot | s_{i})}^{N} \frac{\exp (Q (s_{i}, a_{h}))}{π (\cdot | s_{i})})$

In this second equation, Unif(A_mn,A_mx) is an uniform distribution of action values from A_mn to A_mx, which are the lower and upper limits of the action range. These limits are taken from the action specifications (or are otherwise estimated if unavailable). The probability density function of the distribution, evaluated at a_k is:

$f_{U n i f (A_{m n}, A_{m x})} (a_{k}) = \frac{1}{A_{m x} - A_{m n}}$

Finally, π(∙|s_i) is the distribution of the current policy given s_i.

To set W_cq in the first equation, assign a value to the MinQValueWeight property of the rlConservativeQLearningOptions object.

To set N (the number of actions to be sampled to estimate the logarithm term in the second equation), assign a value to the NumSampledActions property of the rlConservativeQLearningOptions object.

For more information, see [1].

References

[1] Kumar, Aviral, Aurick Zhou, George Tucker, and Sergey Levine. "Conservative q-learning for offline reinforcement learning." Advances in Neural Information Processing Systems 33 (2020): 1179-1191.

Version History

Introduced in R2023a

rlConservativeQLearningOptions

Description

Creation

Syntax

Description

Properties

`MinQValueWeight` — Weight used for Q-value compensation
`1` (default) | positive scalar

`NumSampledActions` — Number of sampled actions used for Q-value compensation
`10` (default) | positive integer

Object Functions

Examples

Create Conservative Q-Learning Options Object

Algorithms

Discrete Actions Regularizer (DQN)

Continuous Actions Regularizer (SAC)

References

Version History

See Also

Objects

Topics

rlConservativeQLearningOptions

Description

Creation

Syntax

Description

Properties

MinQValueWeight — Weight used for Q-value compensation 1 (default) | positive scalar

NumSampledActions — Number of sampled actions used for Q-value compensation 10 (default) | positive integer

Object Functions

Examples

Create Conservative Q-Learning Options Object

Algorithms

Discrete Actions Regularizer (DQN)

Continuous Actions Regularizer (SAC)

References

Version History

See Also

Objects

Topics

`MinQValueWeight` — Weight used for Q-value compensation
`1` (default) | positive scalar

`NumSampledActions` — Number of sampled actions used for Q-value compensation
`10` (default) | positive integer