The goal of reinforcement learning is to train an agent to complete a task within an uncertain environment. The agent receives observations and a reward from the environment and sends actions to the environment. The reward is a measure of how successful an action is with respect to completing the task goal.
The agent contains two components: a policy and a learning algorithm.
The policy is a mapping that selects actions based on the observations from the environment. Typically, the policy is a function approximator with tunable parameters, such as a deep neural network.
The learning algorithm continuously updates the policy parameters based on the actions, observations, and rewards. The goal of the learning algorithm is to find an optimal policy that maximizes the cumulative reward received during the task.
Depending on the learning algorithm, an agent maintains one or more parameterized function approximators for training the policy. There are two types of function approximators.
Critics — For a given observation and action, a critic finds the expected value of the long-term future reward for the task.
Actors — For a given observation, an actor finds the action that maximizes the long-term future reward
For more information on creating actor and critic function approximators, see Create Policy and Value Function Representations.
Reinforcement Learning Toolbox™ software provides the following built-in agents. Each agent can be trained in environments with continuous or discrete observation spaces and the following action spaces.
You can also train policies using other learning algorithms by creating a custom agent. To do so, you create a subclass of a custom agent class, defining the agent behavior using a set of required and optional methods. For more information, see Custom Agents.