RL Agent External action not properly used in SAC

I am using the external action input of a Simulink RL Agent Block at the beginning of training, to guide the agent.
When using PPO, this was enough to let the agent also learn from those forced external actions.
When using SAC, the agent seems to only learn to output 0 with this setup. I finally found that adding the last_action input fixed the setup. In PPO this seems to happen internally.
This woraround is sufficent for me, so there is no need for an immediate solution. I just thought I would report this unexpected behavior. The documentation says, that the external action is used for learning, so I think the way it works with PPO is the desired outcome.

 Accepted Answer

Your observation about the different behaviors between the Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) algorithms is quite interesting and could be valuable for others.
In reinforcement learning, especially in complex environments or with sophisticated algorithms like PPO and SAC, the nuances of how the agent interacts with the environment and learns from actions can significantly impact the learning process and outcomes. Here are a few points to consider:
  1. Algorithm-Specific Behaviors: PPO and SAC are fundamentally different in how they approach policy optimization. PPO tries to keep the new policy close to the old policy, while SAC aims for maximum entropy in addition to reward maximization. This difference might influence how they handle external actions and learning.
  2. Learning from External Actions: The fact that adding the last_action input improved learning in SAC suggests that the algorithm might require explicit historical action data to learn effectively, which PPO might be handling internally.
  3. Documentation and Expected Behavior: If the MATLAB documentation indicates that external actions are used for learning, but the behavior differs between algorithms, it could be worth bringing this to the attention of MathWorks through their support or community forums. This feedback could lead to improved documentation or even enhancements in future software updates.
  4. Practical Solutions: Your workaround of adding the last_action input is a practical solution. In complex systems, such approaches are often necessary to achieve desired outcomes, even if they deviate from the expected or documented behavior.
  5. Community Knowledge Sharing: Sharing these kinds of insights, as you've done, is beneficial for the community. It helps others who might face similar challenges and contributes to a collective understanding of these advanced tools.
  6. Further Experimentation and Reporting: Continued experimentation with these algorithms, and reporting any unusual behaviors or discrepancies with expected outcomes, is valuable. Such feedback is often crucial for the continuous improvement of software tools and algorithms.
---------------------------------------------------------------------------------------------------------------------------------------------------
If you find the solution helpful and it resolves your issue, it would be greatly appreciated if you could accept the answer. Also, leaving an upvote and a comment are also wonderful ways to provide feedback.
Professional Interests
  • Technical Services and Consulting
  • Embedded Systems | Firmware Developement | Simulations
  • Electrical and Electronics Engineering
Feel free to contact me.

1 Comment

Good Point! I will report it as a bug. Will see, what the Matlab team thinks.

Sign in to comment.

More Answers (0)

Asked:

on 11 Jan 2024

Commented:

on 11 Jan 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!