Self-Punishment and Reward Backfill for Deep Q-Learning

Reinforcement learning (RL) agents learn by encouraging behaviors, which maximizes their total reward, usually provided by the environment. In many environments, however, the reward is provided after a series of actions rather than each single action, leading the agent to experience ambiguity in ter...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transaction on neural networks and learning systems 2023-10, Vol.34 (10), p.8086-8093
Hauptverfasser:	Bonyadi, Mohammad Reza, Wang, Rui, Ziaei, Maryam
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Backfill Deep neural networks Estimation Game theory Games Learning systems Neural networks Operations research Policies Punishment Q-learning Reinforcement reinforcement learning (RL) Strategy Task analysis temporal difference (TD) Training Trajectory
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Reinforcement learning (RL) agents learn by encouraging behaviors, which maximizes their total reward, usually provided by the environment. In many environments, however, the reward is provided after a series of actions rather than each single action, leading the agent to experience ambiguity in terms of whether those actions are effective, an issue known as the credit assignment problem. In this brief, we propose two strategies inspired by behavioral psychology to enable the agent to intrinsically estimate more informative reward values for actions with no reward. The first strategy, called self-punishment (SP), discourages the agent from making mistakes that lead to undesirable terminal states. The second strategy, called the reward backfill (RB), backpropagates the rewards between two rewarded actions. We prove that, under certain assumptions and regardless of the RL algorithm used, these two strategies maintain the order of policies in the space of all possible policies in terms of their total reward and, by extension, maintain the optimal policy. Hence, our proposed strategies integrate with any RL algorithm that learns a value or action-value function through experience. We incorporated these two strategies into three popular deep RL approaches and evaluated the results on 30 Atari games. After parameter tuning, our results indicate that the proposed strategies improve the tested methods in over 65% of tested games by up to over 25 times performance improvement.
ISSN:	2162-237X 2162-2388
DOI:	10.1109/TNNLS.2021.3140042