Reinforcement Learning: A Brief Guide
By Emmanouil Tzorakoleftherakis, MathWorks
Reinforcement learning has the potential to solve tough decision-making problems in many applications, including industrial automation, autonomous driving, video game playing, and robotics.
Reinforcement learning is a type of machine learning in which a computer learns to perform a task through repeated interactions with a dynamic environment. This trial-and-error learning approach enables the computer to make a series of decisions without human intervention and without being explicitly programmed to perform the task. One famous example of reinforcement learning in action is AlphaGo, the first computer program to defeat a world champion at the game of Go.
Reinforcement learning works with data from a dynamic environment—in other words, with data that changes based on external conditions, such as weather or traffic flow. The goal of a reinforcement learning algorithm is to find a strategy that will generate the optimal outcome. The way reinforcement learning achieves this goal is by allowing a piece of software called an agent to explore, interact with, and learn from the environment.
An Automated Driving Example
One important aspect of automated driving is self-parking. The goal is for the vehicle computer (agent) to position the car in the correct parking spot and with the correct orientation. In this example, the environment is everything outside the agent—such as the dynamics of the vehicle, nearby vehicles, weather conditions, and so on. During training, the agent uses readings from cameras, GPS, lidar, and other sensors to generate steering, braking, and acceleration commands (actions). To learn how to generate the correct actions from the observations (policy tuning), the agent repeatedly tries to park the vehicle using trial and error. The correct action is rewarded (reinforced) with a numerical signal (Figure 1).
In this example, training is supervised by a training algorithm. The training algorithm is responsible for tuning the agent’s policy based on the collected sensor readings, actions, and rewards. After training, the vehicle’s computer should be able to park using only the tuned policy and the sensor readings.
Algorithms for Reinforcement Learning
Many reinforcement learning training algorithms have been developed to date. Some of the most popular algorithms rely on deep neural networks. The biggest advantage of neural networks is that they can encode complex behaviors, making it possible to use reinforcement learning in applications that would be very challenging to tackle with traditional algorithms.
For example, in autonomous driving, a neural network can replace the driver and decide how to turn the steering wheel by simultaneously looking at input from multiple sensors, such as camera frames and lidar measurements (Figure 2). Without neural networks, the problem would be broken down into smaller pieces: a module that analyzes the camera input to identify useful features, another module that filters the lidar measurements, possibly one component that would aim to paint the full picture of the vehicle’s surroundings by fusing the sensor outputs, a “driver” module, and so on.
Reinforcement Learning Workflow
Training an agent using reinforcement learning involves five steps:
- Create the environment. Define the environment within which the agent can learn, including the interface between agent and environment. The environment can be either a simulation model or a real physical system. Simulated environments are usually a good first step since they are safer and allow experimentation.
- Define the reward. Specify the reward signal that the agent uses to measure its performance against the task goals and how this signal is calculated from the environment. Reward shaping may require a few iterations to get right.
- Create the agent. The agent consists of the policy and the training algorithm, so you need to:
- Choose a way to represent the policy (for example, using neural networks or lookup tables). Consider how you want to structure the parameters and logic that make up the decision-making part of the agent.
- Select the appropriate training algorithm. Most modern reinforcement learning algorithms rely on neural networks because they are good candidates for large state/action spaces and complex problems.
- Train and validate the agent. Set up training options (such as stopping criteria) and train the agent to tune the policy. The easiest way to validate a trained policy is through simulation.
- Deploy the policy. Deploy the trained policy representation using, for example, generated C/C++ or CUDA code. No need to worry about agents and training algorithms at this point—the policy is a standalone decision-making system.
An Iterative Process
Training an agent using reinforcement learning involves a fair amount of trial and error. Decisions and results in later stages can require you to return to an earlier stage in the learning workflow. For example, if the training process does not converge to an optimal policy within a reasonable amount of time, you may have to update any of the following before retraining the agent:
- Training settings
- Learning algorithm configuration
- Policy representation
- Reward signal definition
- Action and observation signals
- Environment dynamics
When Is Reinforcement Learning the Right Approach?
While reinforcement learning is a major advance in machine learning, it is not always the best approach. Here are three issues to bear in mind if you are considering trying it:
- It is not sample-efficient. This means that a lot of training is required to reach acceptable performance. Even for relatively simple applications, training time can take anywhere from minutes to hours or days. AlphaGo was trained by playing millions of games nonstop for several days, accumulating thousands of years’ worth of human knowledge.
- Setting up the problem correctly can be tricky; many design decisions need to be made, which may require several iterations to get right. These decisions include selecting the appropriate architecture for the neural network, tuning hyperparameters, and shaping the reward signal.
- A trained deep neural network policy is a “black box,” meaning that the internal structure of the network is so complex (often consisting of millions of parameters) that it is almost impossible to understand, explain, and evaluate the decisions taken. This makes it difficult to establish formal performance guarantees with neural network policies.
If you are working on a time- or safety-critical project, you might want to try some alternative method. For example, for control design, using a traditional control method would be a good way to start.
Real-World Example: Robot Teaches Itself to Walk
Researchers from the University of Southern California’s Valero Lab built a simple robotic leg that taught itself how to move in just minutes using a reinforcement learning algorithm written in MATLAB® (Figure 3).
The three-tendon, two-joint limb learns autonomously, first by modeling its own dynamic properties and then by using reinforcement learning.
For the physical design, this robotic leg used a tendon architecture, much like the muscle and tendon structure that powers animals’ movements. Reinforcement learning then used the understanding of the dynamics to accomplish the goal of walking on a treadmill.
Reinforcement Learning and “Motor Babbling”
By combining motor babbling with reinforcement learning, the system attempts random motions and learns properties of its dynamics through the results of these motions. For this research, the team began by letting the system play at random, or motor babble. The researchers give the system a reward—in this case, moving the treadmill forward—every time it performs a given task correctly.
The resulting algorithm, called G2P, (general to particular), replicates the general problem that biological nervous systems face when controlling limbs by learning from the movement that occurs when a tendon moves the limb (Figure 4). It is followed by reinforcing (rewarding) the behavior that is particular to the task. In this case, the task is successfully moving the treadmill. The system creates a general understanding of its dynamics through motor babbling and then masters a desired “particular” task by learning from every experience, or G2P.
The neural network, built with MATLAB and Deep Learning Toolbox™, uses the results from the motor babbling to create an inverse map between inputs (movement kinematics) and outputs (motor activations). The network updates the model based on each attempt made during the reinforcement learning phase to home in on the desired results. It remembers the best result each time, and if a new input creates a better result, it overwrites the model with the new settings.
The G2P algorithm can learn a new walking task by itself after only 5 minutes of unstructured play. It can then adapt to other tasks without any additional programming.