RL DDPG agent not converging

Question

Haochen on 17 Nov 2024

0
Link

Direct link to this question

https://www.mathworks.com/matlabcentral/answers/2166863-rl-ddpg-agent-not-converging

Answered: Prathamesh on 3 Jun 2025

Hi,

I am training a DDPG agent to control the single cart with an initial speed moving along a horizontal axis. The RL agent acts as a controller that provides the force in the direction of the axis to assist in its convergence to the origin. It should not be a difficult task , however, after training for many steps, the control effect is still far from optimal.

These are my configurations for the agent and the environment. The optimal policy should be for the force to be equal to zero, meaning that the cart should no longer be moving after it reaches the origin.

The agent by actor critic.

function [agents] = createDDPGAgents(N)
    % Function to create two DDPG agents with the same observation and action info.
    obsInfo = rlNumericSpec([2 1],'LowerLimit',-100*ones(2,1),'UpperLimit',100*ones(2,1));
    actInfo = rlNumericSpec([N 1],'LowerLimit',-100*ones(N,1),'UpperLimit',100*ones(N,1));
    % Define observation and action paths for critic
    obsPath = featureInputLayer(prod(obsInfo.Dimension), Name="obsInLyr");
    actPath = featureInputLayer(prod(actInfo.Dimension), Name="actInLyr");
    
    % Define common path: concatenate along first dimension
    commonPath = [
        concatenationLayer(1, 2, Name="concat")
        fullyConnectedLayer(30)
        reluLayer
        fullyConnectedLayer(1)
    ];
    
    % Add paths to layerGraph network
    criticNet = layerGraph(obsPath);
    criticNet = addLayers(criticNet, actPath);
    criticNet = addLayers(criticNet, commonPath);
    
    % Connect paths
    criticNet = connectLayers(criticNet, "obsInLyr", "concat/in1");
    criticNet = connectLayers(criticNet, "actInLyr", "concat/in2");
    
    % Plot the network
    plot(criticNet)
    
    % Convert to dlnetwork object
    criticNet = dlnetwork(criticNet);
    
    % Display the number of weights
    summary(criticNet)
    
    % Create the critic approximator object
    critic = rlQValueFunction(criticNet, obsInfo, actInfo, ...
        ObservationInputNames="obsInLyr", ...
        ActionInputNames="actInLyr");
    % Check the critic with random observation and action inputs
    getValue(critic, {rand(obsInfo.Dimension)}, {rand(actInfo.Dimension)})
    
    % Create a network to be used as underlying actor approximator
    actorNet = [
        featureInputLayer(prod(obsInfo.Dimension))
        fullyConnectedLayer(30)
        tanhLayer
        fullyConnectedLayer(30)
        tanhLayer
        fullyConnectedLayer(prod(actInfo.Dimension))
    ];
    
    % Convert to dlnetwork object
    actorNet = dlnetwork(actorNet);
    
    % Display the number of weights
    summary(actorNet)
    
    % Create the actor
    actor = rlContinuousDeterministicActor(actorNet, obsInfo, actInfo);
    
    %% DDPG Agent Options
    agentOptions = rlDDPGAgentOptions(...
        'DiscountFactor', 0.98, ...
        'MiniBatchSize', 128, ...
        'TargetSmoothFactor', 1e-3, ...
        'ExperienceBufferLength', 1e6, ...
        'SampleTime', -1);
    %% Create Two DDPG Agents
    agent1 = rlDDPGAgent(actor, critic, agentOptions);
    agent2 = rlDDPGAgent(actor, critic, agentOptions);
    % Return agents as an array
    agents = [agent1, agent2];
    agentOptions.NoiseOptions.MeanAttractionConstant = 0.1;
    agentOptions.NoiseOptions.StandardDeviation = 0.3;
    agentOptions.NoiseOptions.StandardDeviationDecayRate = 8e-4;
    agentOptions.NoiseOptions
end

The envrionment:

function [nextObs, reward, isDone, loggedSignals] = myStepFunction(action, loggedSignals,S)
    % Environment parameters
    nextObs1 = S.A1d*loggedSignals.State + S.B1d*action(1);
    nextObs = nextObs1;
    loggedSignals.State = nextObs1;
    if abs(loggedSignals.State(1))<=0.05 && abs(loggedSignals.State(2))<=0.05 
        reward1 = 10;
    else
        reward1 = -1*(1.01*(nextObs1(1))^2 + 1.01*nextObs1(2)^2 + action^2 );
        if reward1 <= -1000
            reward1 = -1000;
        end
    end
    reward = reward1;
   
    if abs(loggedSignals.State(1))<=0.02 && abs(loggedSignals.State(2))<=0.02
        isDone = true;
    else
        isDone = false;
    end
end

And this is the simulation setup (i omitted the reset function here, and S.N = 1):

obsInfo1 =  rlNumericSpec([2 1],'LowerLimit',-100*ones(2,1),'UpperLimit',100*ones(2,1)) ;
actInfo1 = rlNumericSpec([N 1],'LowerLimit',-100*ones(N,1),'UpperLimit',100*ones(N,1));
stepFn1 = @(action, loggedSignals) myStepFunction(action, loggedSignals, S);
resetFn1 = @() myResetFunction(pos1);
env = rlFunctionEnv(obsInfo1, actInfo1, stepFn1, resetFn1);
%% Specify agent initialization
agent= createDDPGAgents(S.N);
loggedSignals = [];
trainOpts = rlTrainingOptions(...
    StopOnError="on",...
    MaxEpisodes=1000,...  %1100 for fully trained
    MaxStepsPerEpisode=1000,...
    StopTrainingCriteria="AverageReward",...
    StopTrainingValue=480,...
    Plots="training-progress");
    %"training-progress"
train(agent, env, trainOpts);

This is the reward plot wher it it taking very long time for each episode, bt still no signs of reaching the positive reward for this simple system.

And this is the control effect on both states, whichi shows that the RL agent is controlling the a cart to the wrong position near -1 while its velocity is 0.

It is very wierd that the reward is not converging to the positive reward one, but to another point. Can I ask where the problem could be. Thanks.

Haochen

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Prathamesh on 3 Jun 2025

0
Link

Direct link to this answer

https://www.mathworks.com/matlabcentral/answers/2166863-rl-ddpg-agent-not-converging#answer_1565828

Hi @Haochen,

I understand that you are training a DDPG client to control the single cart with an initial speed moving along a horizontal axis. The plots show the agent is not reaching the origin and getting stuck with negative rewards. This is common when the agent is not getting clear enough feedback or isn't exploring enough.

Your agent likely gets a big reward only when it's exactly at the origin. For every other step, it just gets a penalty.

Modify the “myStepFunction” to give the agent continous feedback

Make the reward negative (a penalty) if the cart is away from the origin (position not zero) or if it has speed (velocity not zero).
Also add a small penalty for the amount of force the agent applies. This encourages the agent to use only the necessary force.
You'll need to decide how much each penalty matters. For example, penalize being far from the origin more heavily than using a little bit of force.
The agent will always try to make its reward less negative, pushing it towards the origin using minimal force.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

RL DDPG agent not converging

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

RL DDPG agent not converging

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments