How do I confirm whether the training for DDPG reinforcement learning agent is completed?

Question

1 vote

Hello MathWork Community,

I have been recently working on the training of a DDPG RL agent for electricity network control.

After several rounds of tweaking and training, I got a training curve like this:

I read several questions & answers about the episode Q0 for the agent network with an actor and a critic. If I correctly understand this, when the agent is well trained, the episode Q0 should coverge to the average reward curve. However, it is not always the case and critic network may take longer to be trained.

As shown in my case, it seems that both average reward and episode Q0 have coverged, but to different values. Does this mean there is something wrong with the critic network or the reward function that pervent the critic to estimate the final episode reward correctly?

Btw, this agent is trained for 10000 episode and each episode has 168 steps.