Train DDPG Agent for Path Following Control

This example shows how to train a deep deterministic policy gradient (DDPG) agent for path-following control (PFC) in Simulink®. For more information on DDPG agents, see Deep Deterministic Policy Gradient Agents (Reinforcement Learning Toolbox).

Simulink Model

The reinforcement learning environment for this example is a simple bicycle model for ego car and a simple longitudinal model for lead car. The training goal is to make the ego car travel at a set velocity while maintaining a safe distance from lead car by controlling longitudinal acceleration and braking, while also keeping the ego car travelling along the centerline of its lane by controlling the front steering angle. For more information on PFC, see Path Following Control System. The ego car dynamics are specified by the following parameters.

m = 1600;   % total vehicle mass (kg)
Iz = 2875;  % yaw moment of inertia (mNs^2)
lf = 1.4;   % longitudinal distance from center of gravity to front tires (m)
lr = 1.6;   % longitudinal distance from center of gravity to rear tires (m)
Cf = 19000; % cornering stiffness of front tires (N/rad)
Cr = 33000; % cornering stiffness of rear tires (N/rad)
tau = 0.5;  % longitudinal time constant

Specify the initial position and velocity for the two vehicles.

x0_lead = 50;   % initial position for lead car (m)
v0_lead = 24;   % initial velocity for lead car (m/s)
x0_ego = 10;    % initial position for ego car (m)
v0_ego = 18;    % initial velocity for ego car (m/s)

Specify standstill default spacing (m), time gap (s) and driver-set velocity (m/s).

D_default = 10;
t_gap = 1.4;
v_set = 28;

Considering the physical limitations of the vehicle dynamics, the acceleration is constrained to the range [-3,2] (m/s^2), and steering angle is constrained to be [-0.5,0.5] (rad).

amin_ego = -3;
amax_ego = 2;
umin_ego = -0.5;
umax_ego = 0.5;

The curvature of the road is defined by a constant 0.001(m-1). The initial value for lateral deviation is 0.2 m and the initial value for relative yaw angle is -0.1 rad.

rho = 0.001;
e1_initial = 0.2;
e2_initial = -0.1;

Define the sample time, Ts, and simulation duration, Tf, in seconds.

Ts = 0.1;
Tf = 60;

Open the model.

mdl = 'rlPFCMdl';
open_system(mdl)
agentblk = [mdl '/RL Agent'];

For this model:

  • The action signal consists of acceleration and steering angle actions. The acceleration action signal takes value between -3 and 2 (m/s^2). The steering action signal takes value between -15 degrees (-0.2618 rad) to 15 degrees (0.2618 rad).

  • The reference velocity for ego car Vref is defined as follows. If relative distance is less than safe distance, ego car tracks the minimum of lead car velocity and driver-set velocity. In this manner, ego car maintains some distance from lead car. If relative distance is greater than safe distance, ego car tracks driver-set velocity. In this example, safe distance is defined as a linear function of ego car longitudinal velocity V, that is, tgap*V+Ddefault. The safe distance determines the tracking velocity for the ego car.

  • The observations from the environment contain the longitudinal measurements: the velocity error eV=Vref-Vego, its integral e and ego car longitudinal velocity V. In addition, the observations contain the lateral measurements: the lateral deviation e1, relative yaw angle e2, their derivatives e˙1 and e˙2, and their integrals e1 and e2.

  • The simulation is terminated when lateral deviation |e1|>1 or longitudinal velocity Vego<0.5 or relative distance between lead car and ego car Drel<0.

  • The reward rt, provided at every time step t, is:

rt=-(100e12+500ut-12+10eV2+100at-12)×1e-3-10Ft+2Ht+Mt

where ut-1 is the steering input from the previous time step t-1, at-1 is the acceleration input from the previous time step. The three logical values are: Ft=1if simulation is terminated, otherwise Ft=0; Ht=1 if lateral error e12<0.01, otherwise Ht=0; Mt=1 if velocity error eV2<1, otherwise Mt=0. The three logical terms in the reward encourage the agent to make both lateral error and velocity error small, in the meantime, penalize the agent if the simulation is terminated early.

Create Environment Interface

Create an environment interface for the Simulink model.

% create the observation info
observationInfo = rlNumericSpec([9 1],'LowerLimit',-inf*ones(9,1),'UpperLimit',inf*ones(9,1));
observationInfo.Name = 'observations';
% action Info
actionInfo = rlNumericSpec([2 1],'LowerLimit',[-3;-0.2618],'UpperLimit',[2;0.2618]);
actionInfo.Name = 'accel;steer';
% define environment
env = rlSimulinkEnv(mdl,agentblk,observationInfo,actionInfo);

To define the initial conditions, specify an environment reset function using an anonymous function handle.

% randomize initial positions of lead car, lateral deviation and relative
% yaw angle
env.ResetFcn = @(in)localResetFcn(in);

Fix the random generator seed for reproducibility.

rng(0)

Create DDPG agent

A DDPG agent approximates the long-term reward given observations and actions using a critic value function representation. To create the critic, first create a deep neural network with two inputs, the state and action, and one output. For more information on creating a deep neural network value function representation, see Create Policy and Value Function Representations (Reinforcement Learning Toolbox).

L = 100; % number of neurons
statePath = [
    imageInputLayer([9 1 1],'Normalization','none','Name','observation')
    fullyConnectedLayer(L,'Name','fc1')
    reluLayer('Name','relu1')
    fullyConnectedLayer(L,'Name','fc2')
    additionLayer(2,'Name','add')
    reluLayer('Name','relu2')
    fullyConnectedLayer(L,'Name','fc3')
    reluLayer('Name','relu3')
    fullyConnectedLayer(1,'Name','fc4')];

actionPath = [
    imageInputLayer([2 1 1],'Normalization','none','Name','action')
    fullyConnectedLayer(L,'Name','fc5')];

criticNetwork = layerGraph(statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
    
criticNetwork = connectLayers(criticNetwork,'fc5','add/in2');

View the critic network configuration.

figure
plot(criticNetwork)

Specify options for the critic representation using rlRepresentationOptions.

criticOptions = rlRepresentationOptions('LearnRate',1e-3,'GradientThreshold',1,'L2RegularizationFactor',1e-4);

Create the critic representation using the specified deep neural network and options. You must also specify the action and observation info for the critic, which you obtain from the environment interface. For more information, see rlRepresentation.

critic = rlRepresentation(criticNetwork,observationInfo,actionInfo,...
    'Observation',{'observation'},'Action',{'action'},criticOptions);

A DDPG agent decides which action to take given observations using an actor representation. To create the actor, first create a deep neural network with one input, the observation, and one output, the action.

Construct the actor similarly to the critic.

actorNetwork = [
    imageInputLayer([9 1 1],'Normalization','none','Name','observation')
    fullyConnectedLayer(L,'Name','fc1')
    reluLayer('Name','relu1')
    fullyConnectedLayer(L,'Name','fc2')
    reluLayer('Name','relu2')
    fullyConnectedLayer(L,'Name','fc3')
    reluLayer('Name','relu3')
    fullyConnectedLayer(2,'Name','fc4')
    tanhLayer('Name','tanh1')
    scalingLayer('Name','ActorScaling1','Scale',reshape([2.5;0.2618],[1,1,2]),'Bias',reshape([-0.5;0],[1,1,2]))];
actorOptions = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',1,'L2RegularizationFactor',1e-4);
actor = rlRepresentation(actorNetwork,observationInfo,actionInfo,...
    'Observation',{'observation'},'Action',{'ActorScaling1'},actorOptions);

To create the DDPG agent, first specify the DDPG agent options using rlDDPGAgentOptions.

agentOptions = rlDDPGAgentOptions(...
    'SampleTime',Ts,...
    'TargetSmoothFactor',1e-3,...
    'ExperienceBufferLength',1e6,...
    'DiscountFactor',0.99,...
    'MiniBatchSize',64);
agentOptions.NoiseOptions.Variance = [0.6;0.1];
agentOptions.NoiseOptions.VarianceDecayRate = 1e-5;

Then, create the DDPG agent using the specified actor representation, critic representation and agent options. For more information, see rlDDPGAgent.

agent = rlDDPGAgent(actor,critic,agentOptions);

Train Agent

To train the agent, first specify the training options. For this example, use the following options:

  • Run each training episode for at most 10000 episodes, with each episode lasting at most 600 time steps.

  • Display the training progress in the Episode Manager dialog box.

  • Stop training when the agent receives an average cumulative reward greater than 1700.

For more information, see rlTrainingOptions.

maxepisodes = 1e4;
maxsteps = ceil(Tf/Ts);
trainingOpts = rlTrainingOptions(...
    'MaxEpisodes',maxepisodes,...
    'MaxStepsPerEpisode',maxsteps,...
    'Verbose',false,...
    'Plots','training-progress',...
    'StopTrainingCriteria','EpisodeReward',...
    'StopTrainingValue',1700);

Train the agent using the train function. This is a computationally intensive process that takes several minutes to complete. To save time while running this example, load a pretrained agent by setting doTraining to false. To train the agent yourself, set doTraining to true.

doTraining = false;

if doTraining    
    % Train the agent.
    trainingStats = train(agent,env,trainingOpts);
else
    % Load pretrained agent for the example.
    load('SimulinkPFCDDPG.mat','agent')       
end

Simulate DDPG Agent

To validate the performance of the trained agent, uncomment the following two lines and simulate it within the environment. For more information on agent simulation, see rlSimulationOptions and sim.

% simOptions = rlSimulationOptions('MaxSteps',maxsteps);
% experience = sim(env,agent,simOptions);

To demonstrate the trained agent using deterministic initial conditions, simulate the model in Simulink.

e1_initial = -0.4;
e2_initial = 0.1;
x0_lead = 80;
sim(mdl)

The following plots show the simulation results when lead car is 70 (m) ahead of ego car.

  • In the first 35 seconds, relative distance is greater than safe distance (bottom right plot), thus ego car tracks set velocity (top right plot). To speed up and reach the set velocity, acceleration is mostly non-negative (top left plot).

  • From 35 to 42 seconds, relative distance is mostly less than safe distance (bottom right plot), thus ego car tracks the minimum of lead velocity and set velocity. Since lead velocity is less than set velocity (top right plot), to track lead velocity, acceleration becomes non-zero (top left plot).

  • From 42 to 58 seconds, ego car tracks set velocity (top right plot) and acceleration remains zero (top left plot).

  • From 58 to 60 seconds, relative distance becomes less than safe distance (bottom right plot), thus ego car slows down and tracks lead velocity.

  • The bottom left plot shows the lateral deviation. As shown in the plot, the lateral deviation is greatly decreased within one second. The lateral deviation remains less than 0.05 m.

Close Simulink model.

bdclose(mdl)

Local Function

function in = localResetFcn(in)
% reset 
in = setVariable(in,'x0_lead',40+randi(60,1,1));    % random value for initial position of lead car
in = setVariable(in,'e1_initial', 0.5*(-1+2*rand)); % random value for lateral deviation
in = setVariable(in,'e2_initial', 0.1*(-1+2*rand)); % random value for relative yaw angle
end

See Also

Related Topics