Clear Filters
Clear Filters

freezing layers of actor and critic of RL agent

1 view (last 30 days)
After training ,I have freezed every layer of my actor and crtitc network of my RL agent (by using setLearnRateFactor(neuralnet,'layers','parameters',0);) and then I am retraining my agent in same enviornment and I am getting rewards like as shown in image file.
My ques is is it normal to get rewards like this? (I mean shouldnt there should be no variation or very little variation in rewards.)
my reward function is 10 - e^2 (error).

Answers (1)

Karanjot
Karanjot on 30 Jan 2024
Edited: Karanjot on 30 Jan 2024
Observing fluctuations in rewards is a common occurrence when retraining a reinforcement learning (RL) agent, despite having locked the parameters of both the actor and critic architectures. The agent continues its exploration and learning within the given environment, and the specified reward function significantly influences the reward outcomes.
The variation in rewards can be influenced by several factors, such as the exploration-exploitation trade-off, the complexity of the environment, and the learning rate of the agent. It is possible that the agent is still trying to optimize its policy and may encounter different states or actions that result in varying rewards.
The environment’s inherent stochasticity can lead to different state transitions and rewards for similar actions. Additionally, if there are other unfrozen parameters or noise processes involved in action selection, they could contribute to the observed variations
You may consider the following steps:
  1. Plot the rewards over time during the retraining process to observe the trend. This can help you understand if the rewards are converging or not.
  2. Experiment with different learning rates for the agent. A higher learning rate may lead to faster convergence but could also result in more variation initially.
  3. You can also try modifying the reward function to see if it reduces the variation in rewards.
Keep in mind that RL training is inherently iterative, and achieving an optimal policy often requires multiple iterations. While some degree of reward variation is to be anticipated, it may indicate a need for further investigation or adjustments in your training setup.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!