Train DDPG Agent for Path-Following Control
This example shows how to train a deep deterministic policy gradient (DDPG) agent for path-following control (PFC) in Simulink®. For more information on DDPG agents, see Deep Deterministic Policy Gradient (DDPG) Agents (Reinforcement Learning Toolbox).
The reinforcement learning environment for this example is a simple bicycle model for the ego car and a simple longitudinal model for the lead car. The training goal is to make the ego car travel at a set velocity while maintaining a safe distance from lead car by controlling longitudinal acceleration and braking, and also while keeping the ego car travelling along the centerline of its lane by controlling the front steering angle. For more information on PFC, see Path Following Control System (Model Predictive Control Toolbox). The ego car dynamics are specified by the following parameters.
m = 1600; % total vehicle mass (kg) Iz = 2875; % yaw moment of inertia (mNs^2) lf = 1.4; % longitudinal distance from center of gravity to front tires (m) lr = 1.6; % longitudinal distance from center of gravity to rear tires (m) Cf = 19000; % cornering stiffness of front tires (N/rad) Cr = 33000; % cornering stiffness of rear tires (N/rad) tau = 0.5; % longitudinal time constant
Specify the initial position and velocity for the two vehicles.
x0_lead = 50; % initial position for lead car (m) v0_lead = 24; % initial velocity for lead car (m/s) x0_ego = 10; % initial position for ego car (m) v0_ego = 18; % initial velocity for ego car (m/s)
Specify the standstill default spacing (m), time gap (s), and driver-set velocity (m/s).
D_default = 10; t_gap = 1.4; v_set = 28;
To simulate the physical limitations of the vehicle dynamics, constrain the acceleration to the range
[–3,2] (m/s^2), and the steering angle to the range
[–0.2618,0.2618] (rad), that is -15 and 15 degrees.
amin_ego = -3; amax_ego = 2; umin_ego = -0.2618; % +15 deg umax_ego = 0.2618; % -15 deg
The curvature of the road is defined by a constant 0.001 (). The initial value for lateral deviation is 0.2 m and the initial value for the relative yaw angle is –0.1 rad.
rho = 0.001; e1_initial = 0.2; e2_initial = -0.1;
Define the sample time
Ts and simulation duration
Tf in seconds.
Ts = 0.1; Tf = 60;
Open the model.
mdl = 'rlPFCMdl'; open_system(mdl) agentblk = [mdl '/RL Agent'];
For this model:
The action signal consists of acceleration and steering angle actions. The acceleration action signal takes value between –3 and 2 (m/s^2). The steering action signal takes a value between –15 degrees (–0.2618 rad) and 15 degrees (0.2618 rad).
The reference velocity for the ego car is defined as follows. If the relative distance is less than the safe distance, the ego car tracks the minimum of the lead car velocity and driver-set velocity. In this manner, the ego car maintains some distance from the lead car. If the relative distance is greater than the safe distance, the ego car tracks the driver-set velocity. In this example, the safe distance is defined as a linear function of the ego car longitudinal velocity , that is, . The safe distance determines the tracking velocity for the ego car.
The observations from the environment contain the longitudinal measurements: the velocity error , its integral , and the ego car longitudinal velocity . In addition, the observations contain the lateral measurements: the lateral deviation , relative yaw angle , their derivatives and , and their integrals and .
The simulation terminates when the lateral deviation , when the longitudinal velocity , or when the relative distance between the lead car and ego car .
The reward , provided at every time step , is
where is the steering input from the previous time step , is the acceleration input from the previous time step. The three logical values are as follows.
if simulation is terminated, otherwise
if lateral error , otherwise
if velocity error , otherwise
The three logical terms in the reward encourage the agent to make both lateral error and velocity error small, and in the meantime, penalize the agent if the simulation is terminated early.
Create Environment Interface
Create an environment interface for the Simulink model.
Create the observation specification.
observationInfo = rlNumericSpec([9 1],'LowerLimit',-inf*ones(9,1),'UpperLimit',inf*ones(9,1)); observationInfo.Name = 'observations';
Create the action specification.
actionInfo = rlNumericSpec([2 1],'LowerLimit',[-3;-0.2618],'UpperLimit',[2;0.2618]); actionInfo.Name = 'accel;steer';
Create the environment interface.
env = rlSimulinkEnv(mdl,agentblk,observationInfo,actionInfo);
To define the initial conditions, specify an environment reset function using an anonymous function handle. The reset function
localResetFcn, which is defined at the end of the example, randomizes the initial position of the lead car, the lateral deviation, and the relative yaw angle.
env.ResetFcn = @(in)localResetFcn(in);
Fix the random generator seed for reproducibility.
Create DDPG Agent
A DDPG agent approximates the long-term reward given observations and actions by using a critic value function representation. To create the critic, first create a deep neural network with two inputs, the state and action, and one output. For more information on creating a deep neural network value function representation, see Create Policies and Value Functions (Reinforcement Learning Toolbox).
L = 100; % number of neurons statePath = [ featureInputLayer(9,'Normalization','none','Name','observation') fullyConnectedLayer(L,'Name','fc1') reluLayer('Name','relu1') fullyConnectedLayer(L,'Name','fc2') additionLayer(2,'Name','add') reluLayer('Name','relu2') fullyConnectedLayer(L,'Name','fc3') reluLayer('Name','relu3') fullyConnectedLayer(1,'Name','fc4')]; actionPath = [ featureInputLayer(2,'Normalization','none','Name','action') fullyConnectedLayer(L,'Name','fc5')]; criticNetwork = layerGraph(statePath); criticNetwork = addLayers(criticNetwork,actionPath); criticNetwork = connectLayers(criticNetwork,'fc5','add/in2'); criticNetwork = dlnetwork(criticNetwork);
View the critic network configuration.
Specify options for the critic optimizer using
criticOptions = rlOptimizerOptions('LearnRate',1e-3,'GradientThreshold',1,'L2RegularizationFactor',1e-4);
Create the critic function using the specified deep neural network. You must also specify the action and observation info for the critic, which you obtain from the environment interface. For more information, see
rlQValueFunction (Reinforcement Learning Toolbox).
critic = rlQValueFunction(criticNetwork,observationInfo,actionInfo,... 'ObservationInputNames','observation','ActionInputNames','action');
A DDPG agent decides which action to take given observations by using an actor representation. To create the actor, first create a deep neural network with one input, the observation, and one output, the action.
Construct the actor similarly to the critic. For more information, see
rlContinuousDeterministicActor (Reinforcement Learning Toolbox).
actorNetwork = [ featureInputLayer(9,'Normalization','none','Name','observation') fullyConnectedLayer(L,'Name','fc1') reluLayer('Name','relu1') fullyConnectedLayer(L,'Name','fc2') reluLayer('Name','relu2') fullyConnectedLayer(L,'Name','fc3') reluLayer('Name','relu3') fullyConnectedLayer(2,'Name','fc4') tanhLayer('Name','tanh1') scalingLayer('Name','ActorScaling1','Scale',[2.5;0.2618],'Bias',[-0.5;0])]; actorNetwork = dlnetwork(actorNetwork); actorOptions = rlOptimizerOptions('LearnRate',1e-4,'GradientThreshold',1,'L2RegularizationFactor',1e-4); actor = rlContinuousDeterministicActor(actorNetwork,observationInfo,actionInfo);
To create the DDPG agent, first specify the DDPG agent options using
rlDDPGAgentOptions (Reinforcement Learning Toolbox).
agentOptions = rlDDPGAgentOptions(... 'SampleTime',Ts,... 'ActorOptimizerOptions',actorOptions,... 'CriticOptimizerOptions',criticOptions,... 'ExperienceBufferLength',1e6); agentOptions.NoiseOptions.Variance = [0.6;0.1]; agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;
Then, create the DDPG agent using the specified actor representation, critic representation, and agent options. For more information, see
rlDDPGAgent (Reinforcement Learning Toolbox).
agent = rlDDPGAgent(actor,critic,agentOptions);
To train the agent, first specify the training options. For this example, use the following options:
Run each training episode for at most
10000episodes, with each episode lasting at most
Display the training progress in the Episode Manager dialog box (set the
Stop training when the agent receives an cumulative episode reward greater than
For more information, see
rlTrainingOptions (Reinforcement Learning Toolbox).
maxepisodes = 1e4; maxsteps = ceil(Tf/Ts); trainingOpts = rlTrainingOptions(... 'MaxEpisodes',maxepisodes,... 'MaxStepsPerEpisode',maxsteps,... 'Verbose',false,... 'Plots','training-progress',... 'StopTrainingCriteria','EpisodeReward',... 'StopTrainingValue',1700);
Train the agent using the
train (Reinforcement Learning Toolbox) function. Training is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting
false. To train the agent yourself, set
doTraining = false; if doTraining % Train the agent. trainingStats = train(agent,env,trainingOpts); else % Load a pretrained agent for the example. load('SimulinkPFCDDPG.mat','agent') end
Simulate DDPG Agent
To validate the performance of the trained agent, simulate the agent within the Simulink environment by uncommenting the following commands. For more information on agent simulation, see
rlSimulationOptions (Reinforcement Learning Toolbox) and
sim (Reinforcement Learning Toolbox).
% simOptions = rlSimulationOptions('MaxSteps',maxsteps); % experience = sim(env,agent,simOptions);
To demonstrate the trained agent using deterministic initial conditions, simulate the model in Simulink.
e1_initial = -0.4; e2_initial = 0.1; x0_lead = 80; sim(mdl)
The following plots show the simulation results when the lead car is 70 (m) ahead of the ego car.
In the first 35 seconds, the relative distance is greater than the safe distance (bottom-right plot), so the ego car tracks the set velocity (top-right plot). To speed up and reach the set velocity, the acceleration is mostly nonnegative (top-left plot).
From 35 to 42 seconds, the relative distance is mostly less than the safe distance (bottom-right plot), so the ego car tracks the minimum of the lead velocity and set velocity. Because the lead velocity is less than the set velocity (top-right plot), to track the lead velocity, the acceleration becomes nonzero (top-left plot).
From 42 to 58 seconds, the ego car tracks the set velocity (top-right plot) and the acceleration remains zero (top-left plot).
From 58 to 60 seconds, the relative distance becomes less than the safe distance (bottom-right plot), so the ego car slows down and tracks the lead velocity.
The bottom-left plot shows the lateral deviation. As shown in the plot, the lateral deviation is greatly decreased within 1 second. The lateral deviation remains less than 0.05 m.
Close the Simulink model.
function in = localResetFcn(in) in = setVariable(in,'x0_lead',40+randi(60,1,1)); % random value for initial position of lead car in = setVariable(in,'e1_initial', 0.5*(-1+2*rand)); % random value for lateral deviation in = setVariable(in,'e2_initial', 0.1*(-1+2*rand)); % random value for relative yaw angle end
train (Reinforcement Learning Toolbox)