How to bound DQN critic estimate or RL training progress y-axis

I'm training a DQN agent from the new Reinforcement Learning toolbox. During training, the critic network generates long-term reward estimates (Q0) throughout each episode - these are displayed in green on the training progress plot. In blue and red are the episode and running average reward, respectively. As you can see, the actual rewards average around -1000, but the first few estimates were orders of magnitude greater, and so they permanently skew the y-axis. Therefore we cannot discern the progress of actual rewards in training.
large_Q0_ex.PNG
It seems I either need to bound the critic's estimate, or set limits on the Reinforcement Learning Episode Manager's y-axis. I haven't found a way to do either.

 Accepted Answer

Hello,
I believe the best approach here is to figure out why the critic estimate takes large values. Even if you scale the plot window, if the critic estimates are off, it will have an impact on training. Also, bounding the estimate values would not be ideal either, because you are losing information (one action may be better than another, but this won't be reflected in the estimate). A few things to try:
1) Make sure that the gradient threshold option ins the representation options of the network if finite, e.g. set it to '1'. This will prevent the weights from changing too much during training.
2) Try reducing the number of layers/nodes
3) Try providing initial (small) values for the network weights (especially the last FC layer)
4) Maybe adding a scaling layer towards the end would be helpful as well

6 Comments

I agree that I should focus on getting better initial estimates.
  1. I tried changing my gradient threshold from the default (infinity) to 1, but that didn't seem to help.
  2. My current architecture is already seems fairly small - 3 layers on the state path, 1 layer on action path, 2 layers on common output path (24 nodes on each except for one node on last layer). Might this still be too big?
  3. There are many options for weight initializers, so I'm not sure which to choose for each layer. Do you have any tips for this or can you point me to further reading?
  4. I added a scaling layer just before the output, and set the scale & bias according to what I was seeing in the original estimates. This had a great impact and brought the estimates much closer to reality! I would still like to avoid this reactive approach however, so smarter weight initialization I think is preferable.
For weight initialization see the answer here. You can also directly set weight values using the 'Weights' option of the FC layer - see for instance the 'createNetworks' script in the walking robot example.
The scaling layer is actually a better alternative than bounding, because it is a linear process which is taken into consideration during training.
how the parameters of scaling layer (scale & bias) could be choosen? for critic output
Can you please post a separate question as this one is from 3 years ago? thank you
sir during the training i get some rewards as high as 10^16 (see the screenshot attached) can you plz help me with what am i doing wrong.
this is the code i am using
Tf = 10;
Ts = 0.1;
mdl = 'rl_exam2'
obsInfo = rlNumericSpec([3 1]);
obsInfo.Name = 'observations';
obsInfo.Description = 'integrated error, error, Response';
numObservations = obsInfo.Dimension(1)
actInfo = rlNumericSpec([1 1],'LowerLimit',0,'UpperLimit',1);
actInfo.Name = 'Control Input';
numActions = actInfo.Dimension(1);
%% To Create Environment
env = rlSimulinkEnv(mdl,[mdl '/RL Agent'],obsInfo,actInfo);
%%
rng(0)
%%
%% To Create Critic Network
statePath = [
imageInputLayer([numObservations 1 1],'Normalization','none','Name','State')
fullyConnectedLayer(50,'Name','CriticStateFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(40,'Name','CriticStateFC2')];
actionPath = [
imageInputLayer([numActions 1 1],'Normalization','none','Name','Action')
fullyConnectedLayer(40,'Name','CriticActionFC1')];
commonPath = [
additionLayer(2,'Name','add')
reluLayer('Name','CriticCommonRelu')
fullyConnectedLayer(1,'Name','CriticOutput')];
criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,statePath);
criticNetwork = addLayers(criticNetwork,actionPath);
criticNetwork = addLayers(criticNetwork,commonPath);
criticNetwork = connectLayers(criticNetwork,'CriticStateFC2','add/in1');
criticNetwork = connectLayers(criticNetwork,'CriticActionFC1','add/in2');
criticOpts = rlRepresentationOptions('LearnRate',1e-03,'GradientThreshold',1);
critic = rlQValueRepresentation(criticNetwork,obsInfo,actInfo,'Observation',{'State'},'Action',{'Action'},criticOpts);
actorNetwork = [
imageInputLayer([numObservations 1 1],'Normalization','none','Name','State')
fullyConnectedLayer(40,'Name','actorFC1')
reluLayer('Name','ActorRelu1')
fullyConnectedLayer(numActions,'Name','actorFC2')
tanhLayer('Name','actorTanh')
scalingLayer('Name','Action','Scale',0.5,'Bias',0.5)
];
actorOptions = rlRepresentationOptions('LearnRate',1e-04,'GradientThreshold',1);
actor = rlDeterministicActorRepresentation(actorNetwork,obsInfo,actInfo,'Observation',{'State'},'Action',{'Action'},actorOptions);
%% To Create Agent
agentOpts = rlDDPGAgentOptions(...
'SampleTime',0.1,...
'TargetSmoothFactor',1e-3,...
'DiscountFactor',1,...
'ExperienceBufferLength',1e6,...
'MiniBatchSize',64,...
'ExperienceBufferLength',1e6);
agentOpts.NoiseOptions.Variance = 0.08;
agentOpts.NoiseOptions.VarianceDecayRate = 1e-5;
agent = rlDDPGAgent(actor,critic,agentOpts)
%% Training Options
maxepisodes = 3000;
maxsteps = ceil(Tf/Ts);
trainingOpts = rlTrainingOptions(...
'MaxEpisodes',maxepisodes,...
'MaxStepsPerEpisode',maxsteps,...
'ScoreAveragingWindowLength',20, ...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','EpisodeCount',...
'StopTrainingValue',1500);
%% TO TRAIN
doTraining = true;
if doTraining
trainingStats = train(agent,env,trainingOpts);
% save('agent_new.mat','agent_ready') %%% to save agent ###
else
% Load pretrained agent for the example.
load('agent_old.mat','agent')
end
sir can i please get any mail id or you can mail me at sourabhy711@gmai.com. I would be very gratefull.
your amount of reward is just becuse what you design for your reward function. this amount of high negetive reward shows how bad in some episode(probably in exploremode) your agent is doing. you should stop that episode with a reasonable is done condition and a reasonable negetive reward for terminating that episode.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!