- Make the reward negative (a penalty) if the cart is away from the origin (position not zero) or if it has speed (velocity not zero).
- Also add a small penalty for the amount of force the agent applies. This encourages the agent to use only the necessary force.
- You'll need to decide how much each penalty matters. For example, penalize being far from the origin more heavily than using a little bit of force.
- The agent will always try to make its reward less negative, pushing it towards the origin using minimal force.
RL DDPG agent not converging
14 views (last 30 days)
Show older comments
Hi,
I am training a DDPG agent to control the single cart with an initial speed moving along a horizontal axis. The RL agent acts as a controller that provides the force in the direction of the axis to assist in its convergence to the origin. It should not be a difficult task , however, after training for many steps, the control effect is still far from optimal.
These are my configurations for the agent and the environment. The optimal policy should be for the force to be equal to zero, meaning that the cart should no longer be moving after it reaches the origin.
The agent by actor critic.
function [agents] = createDDPGAgents(N)
% Function to create two DDPG agents with the same observation and action info.
obsInfo = rlNumericSpec([2 1],'LowerLimit',-100*ones(2,1),'UpperLimit',100*ones(2,1));
actInfo = rlNumericSpec([N 1],'LowerLimit',-100*ones(N,1),'UpperLimit',100*ones(N,1));
% Define observation and action paths for critic
obsPath = featureInputLayer(prod(obsInfo.Dimension), Name="obsInLyr");
actPath = featureInputLayer(prod(actInfo.Dimension), Name="actInLyr");
% Define common path: concatenate along first dimension
commonPath = [
concatenationLayer(1, 2, Name="concat")
fullyConnectedLayer(30)
reluLayer
fullyConnectedLayer(1)
];
% Add paths to layerGraph network
criticNet = layerGraph(obsPath);
criticNet = addLayers(criticNet, actPath);
criticNet = addLayers(criticNet, commonPath);
% Connect paths
criticNet = connectLayers(criticNet, "obsInLyr", "concat/in1");
criticNet = connectLayers(criticNet, "actInLyr", "concat/in2");
% Plot the network
plot(criticNet)
% Convert to dlnetwork object
criticNet = dlnetwork(criticNet);
% Display the number of weights
summary(criticNet)
% Create the critic approximator object
critic = rlQValueFunction(criticNet, obsInfo, actInfo, ...
ObservationInputNames="obsInLyr", ...
ActionInputNames="actInLyr");
% Check the critic with random observation and action inputs
getValue(critic, {rand(obsInfo.Dimension)}, {rand(actInfo.Dimension)})
% Create a network to be used as underlying actor approximator
actorNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(30)
tanhLayer
fullyConnectedLayer(30)
tanhLayer
fullyConnectedLayer(prod(actInfo.Dimension))
];
% Convert to dlnetwork object
actorNet = dlnetwork(actorNet);
% Display the number of weights
summary(actorNet)
% Create the actor
actor = rlContinuousDeterministicActor(actorNet, obsInfo, actInfo);
%% DDPG Agent Options
agentOptions = rlDDPGAgentOptions(...
'DiscountFactor', 0.98, ...
'MiniBatchSize', 128, ...
'TargetSmoothFactor', 1e-3, ...
'ExperienceBufferLength', 1e6, ...
'SampleTime', -1);
%% Create Two DDPG Agents
agent1 = rlDDPGAgent(actor, critic, agentOptions);
agent2 = rlDDPGAgent(actor, critic, agentOptions);
% Return agents as an array
agents = [agent1, agent2];
agentOptions.NoiseOptions.MeanAttractionConstant = 0.1;
agentOptions.NoiseOptions.StandardDeviation = 0.3;
agentOptions.NoiseOptions.StandardDeviationDecayRate = 8e-4;
agentOptions.NoiseOptions
end
The envrionment:
function [nextObs, reward, isDone, loggedSignals] = myStepFunction(action, loggedSignals,S)
% Environment parameters
nextObs1 = S.A1d*loggedSignals.State + S.B1d*action(1);
nextObs = nextObs1;
loggedSignals.State = nextObs1;
if abs(loggedSignals.State(1))<=0.05 && abs(loggedSignals.State(2))<=0.05
reward1 = 10;
else
reward1 = -1*(1.01*(nextObs1(1))^2 + 1.01*nextObs1(2)^2 + action^2 );
if reward1 <= -1000
reward1 = -1000;
end
end
reward = reward1;
if abs(loggedSignals.State(1))<=0.02 && abs(loggedSignals.State(2))<=0.02
isDone = true;
else
isDone = false;
end
end
And this is the simulation setup (i omitted the reset function here, and S.N = 1):
obsInfo1 = rlNumericSpec([2 1],'LowerLimit',-100*ones(2,1),'UpperLimit',100*ones(2,1)) ;
actInfo1 = rlNumericSpec([N 1],'LowerLimit',-100*ones(N,1),'UpperLimit',100*ones(N,1));
stepFn1 = @(action, loggedSignals) myStepFunction(action, loggedSignals, S);
resetFn1 = @() myResetFunction(pos1);
env = rlFunctionEnv(obsInfo1, actInfo1, stepFn1, resetFn1);
%% Specify agent initialization
agent= createDDPGAgents(S.N);
loggedSignals = [];
trainOpts = rlTrainingOptions(...
StopOnError="on",...
MaxEpisodes=1000,... %1100 for fully trained
MaxStepsPerEpisode=1000,...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=480,...
Plots="training-progress");
%"training-progress"
train(agent, env, trainOpts);
This is the reward plot wher it it taking very long time for each episode, bt still no signs of reaching the positive reward for this simple system.

And this is the control effect on both states, whichi shows that the RL agent is controlling the a cart to the wrong position near -1 while its velocity is 0.

It is very wierd that the reward is not converging to the positive reward one, but to another point. Can I ask where the problem could be. Thanks.
Haochen
0 Comments
Answers (1)
Prathamesh
on 3 Jun 2025
I understand that you are training a DDPG client to control the single cart with an initial speed moving along a horizontal axis. The plots show the agent is not reaching the origin and getting stuck with negative rewards. This is common when the agent is not getting clear enough feedback or isn't exploring enough.
Your agent likely gets a big reward only when it's exactly at the origin. For every other step, it just gets a penalty.
Modify the “myStepFunction” to give the agent continous feedback
0 Comments
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!