Quadruped Robot Locomotion Using DDPG Agent

This example shows how to train a quadruped robot, modeled using Simscape Multibody, to walk using a deep deterministic policy gradient (DDPG) agent. For more information on DDPG agents, see Twin-Delayed Deep Deterministic Policy Gradient Agents.

Load necessary parameters to the base workspace in MATLAB®.


Quadruped Robot Model

The environment for this example is a quadruped robot, and the training goal is to make the robot walk in a straight line using minimal control effort.

The robot is modeled using Simscape Multibody and the Simscape Multibody Contact Forces Library. The main structural components are four legs and a torso as shown in the figure. The legs are connected to the torso through revolute joints. Action values provided by the RL Agent block are scaled and converted into joint torque values. These joint torque values are used by the revolute joints to compute motion.

Open the model.

mdl = 'rlQuadrupedRobot';


The robot environment provides are 44 observations to the agent, each normalized between -1 and 1. These observations are:

  • Y (vertical) and Y (lateral) position of the torso center of mass

  • Quaternion representing the orientation of the torso

  • X (forward), Y (vertical), and Z (lateral) velocities of the torso at the center of mass

  • Roll, pitch, and yaw rates of the torso

  • Angular positions and velocities of the hip and knee joints for each leg

  • Normal and friction force due to ground contact for each leg

  • Action values (torque for each joint) from the previous time step

For all four legs, the initial values for the hip and knee joint angles are set to -0.8234 and 1.6468 radians, respectively. The neutral positions of the joints are at 0 radians. This occurs when the legs are stretched to their maximum and are aligned perpendicularly to the ground.


The agent generates eight actions normalized between -1 and 1. After multiplying with a scaling factor, these correspond to the eight joint torque signals for the revolute joints. The overall joint torque bounds are +/- 10 Nm for each joint.


The following reward function is provided to the agent performance at each time step during training. This reward function encourages the agent to move forward by providing a positive reward for positive forward velocity. It also encourages the agent to avoid early termination by providing a constant reward (25Ts/Tf25 Ts/Tf) at each time step. The remaining terms in the reward function are penalties that discourage unwanted states, such as large deviations from the desired height and orientation or use of excessive joint torques.

rt=vx+25TsTf-50yˆ2-20θ2-0.02iut-1i2r(t) = vx(t) + 25 * Ts/Tf - - 50 * ĥ(t)^2 - 20 * θ(t)2 - 0.02 * Σ u(t-1)^2


  • vxvx(t) is the velocity of the torso's center of mass in the x-direction.

  • TsTs and TfTf are the sample time and final simulation time of the environment, respectively.

  • yˆ is the scaled height error of the torso's center of mass from the desired height of 0.75m.

  • θ is the pitch angle of the torso.

  • ut-1iu(t-1)is the action value for joint i from the previous time step.

Episode termination

During training or simulation, the episode terminates if

  • The height of the torso center of mass from the ground is below 0.5 m (fallen)

  • The head or tail of the torso is below the ground

  • Any knee joint is below the ground

  • Roll, pitch, or yaw angles are outside bounds (+/- 0.1745, 0.1745 and 0.3491 radians, respectively).

Create Environment Interface

Specify the parameters for the observation set.

numObs = 44;
obsInfo = rlNumericSpec([numObs 1]);
obsInfo.Name = 'observations';

Specify the parameters for the action set.

numAct = 8;
actInfo = rlNumericSpec([numAct 1],'LowerLimit',-1,'UpperLimit', 1);
actInfo.Name = 'torque';

Create the environment using the reinforcement learning model.

blk = [mdl, '/RL Agent'];
env = rlSimulinkEnv(mdl,blk,obsInfo,actInfo);

During training, the reset function introduces random deviations into the initial joint angles and angular velocities.

env.ResetFcn = @quadrupedResetFcn;

Create DDPG agent

The DDPG agent approximates the long-term reward given observations and actions using a critic value function representation. The agent also decides which action to take given the observations, using an actor representation. The actor and critic networks for this example are inspired by [2].

For more information on creating a deep neural network value function representation, see Create Policy and Value Function Representations. For an example that creates neural networks for DDPG agents, see Train DDPG Agent to Control Double Integrator System.

Create the networks in the MATLAB workspace using the createNetworks helper function.


You can also create your actor and critic networks interactively using the Deep Network Designer app.

View the critic network configuration.


Specify the agent options using rlDDPGAgentOptions.

agentOptions = rlDDPGAgentOptions;
agentOptions.SampleTime = Ts;
agentOptions.DiscountFactor = 0.99;
agentOptions.MiniBatchSize = 250;
agentOptions.ExperienceBufferLength = 1e6;
agentOptions.TargetSmoothFactor = 1e-3;
agentOptions.NoiseOptions.MeanAttractionConstant = 0.15;
agentOptions.NoiseOptions.Variance = 0.1;

Create the rlDDPGAgent object for the agent.

agent = rlDDPGAgent(actor,critic,agentOptions);

Specify Training Options

To train the agent, first specify the following training options:

  • Run each training episode for at most 10000 episodes, with each episode lasting at most maxSteps time steps.

  • Display the training progress in the Episode Manager dialog box (set the Plots option) and disable the command line display (set the Verbose option).

  • Stop training when the agent receives an average cumulative reward greater than 190 over 250 consecutive episodes.

  • Save a copy of the agent for each episode where the cumulative reward is greater than 200.

maxEpisodes = 10000;
maxSteps = floor(Tf/Ts);  
trainOpts = rlTrainingOptions(...

Specify the following training options to train the agent in parallel training mode. If you do not have Parallel Computing Toolbox™ software installed, set UseParallel to false.

  • Set the UseParallel option to true.

  • Train the agent in parallel asynchronously.

  • After every 32 steps, each worker sends experiences to the host.

  • DDPG agents require workers to send 'Experiences' to the host.

trainOpts.UseParallel = true;                    
trainOpts.ParallelizationOptions.Mode = 'async';
trainOpts.ParallelizationOptions.StepsUntilDataIsSent = 32;
trainOpts.ParallelizationOptions.DataToSendFromWorkers = 'Experiences';

Train Agent

Train the agent using the train function. Due to the complexity of the robot model, this process is computationally intensive and takes several hours to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true. Due to randomness of the parallel training, you can expect different training results from the plot below.

doTraining = false;
if doTraining    
    % Train the agent
    trainingStats = train(agent,env,trainOpts);
    % Load pretrained agent for the example

Note that due to randomness in the parallel training, different training results are expected.

Simulate Trained Agent

Fix the random generator seed for reproducibility.


To validate the performance of the trained agent, simulate it within the robot environment. For more information on agent simulation, see rlSimulationOptions and sim.

simOptions = rlSimulationOptions('MaxSteps',maxSteps);
experience = sim(env,agent,simOptions);


[1] N. Heess et al, "Emergence of Locomotion Behaviours in Rich Environments," Technical Report, ArXiv, 2017.

[2] T.P. Lillicrap et al, "Continuous Control with Deep Reinforcement Learning," International Conference on Learning Representations, 2016.

See Also

Related Topics