Documentation

# rlPPOAgentOptions

Create options for PPO agent

## Description

example

opt = rlPPOAgentOptions creates an rlPPOAgentOptions object for use as an argument when creating a PPO agent using all default settings. You can modify the object properties using dot notation.

opt = rlPPOAgentOptions(Name,Value) creates a PPO agent options object using the specified name-value pairs to override default property values.

## Examples

collapse all

Create a PPO agent options object, specifying the experience horizon.

opt = rlPPOAgentOptions('ExperienceHorizon',256)
opt =
rlPPOAgentOptions with properties:

ExperienceHorizon: 256
MiniBatchSize: 128
ClipFactor: 0.2000
EntropyLossWeight: 0.0100
NumEpoch: 3
GAEFactor: 0.9500
SampleTime: 1
DiscountFactor: 0.9900

You can modify options using dot notation. For example, set the agent sample time to 0.5.

opt.SampleTime = 0.5;

## Input Arguments

collapse all

### Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

Example: 'ExperienceHorizon',256

Number of steps the agent interacts with the environment before learning from its experience, specified as the comma-separated pair consisting of 'ExperienceHorizon' and a positive integer.

The ExperienceHorizon value must be greater than or equal to the MiniBatchSize value.

Clip factor for limiting the change in each policy update step, specified as the comma-separated pair consisting of 'ClipFactor' and a positive scalar less than 1.

Entropy loss weight, specified as the comma-separated pair consisting of 'EntropyLossWeight' and a scalar value between 0 and 1. A higher loss weight value promotes agent exploration by applying a penalty for being too certain about which action to take. Doing so can help the agent move out of local optima.

For episode step t, the entropy loss function, which is added to the loss function for actor updates, is:

${H}_{t}=E\sum _{k=1}^{M}{\mu }_{k}\left({S}_{t}|{\theta }_{\mu }\right)\mathrm{ln}{\mu }_{k}\left({S}_{t}|{\theta }_{\mu }\right)$

Here:

• E is the entropy loss weight.

• M is the number of possible actions.

• μk(St|θμ) is the probability of taking action Ak when in state St following the current policy.

Mini-batch size used for each learning epoch, specified as the comma-separated pair consisting of 'MiniBatchSize' and a positive integer.

The MiniBatchSize value must be less than or equal to the ExperienceHorizon value.

Number of epochs for which the actor and critic networks learn from the current experience set, specified as the comma-separated pair consisting of 'NumEpoch' and a positive integer.

Method for estimating advantage values, specified as the comma-separated pair consisting of 'AdvantageEstimateMethod' and one of the following:

• "gae" — Generalized advantage estimator

• "finite-horizon" — Finite horizon estimation

For more information on these methods, see the training algorithm information in Proximal Policy Optimization Agents.

Smoothing factor for generalized advantage estimator, specified as the comma-separated pair consisting of 'GAEFactor' and a scalar value between 0 and 1, inclusive. This option applies only when the AdvantageEstimateMethod option is "gae"

Sample time of agent, specified as the comma-separated pair consisting of 'SampleTime' and a positive scalar.

Discount factor applied to future rewards during training, specified as the comma-separated pair consisting of 'DiscountFactor' and a positive scalar less than or equal to 1.

## Output Arguments

collapse all

PPO agent options, returned as an rlPPOAgentOptions object. The object properties are described in Name-Value Pair Arguments.