Partial Consistency for Stabilizing Undiscounted Reinforcement Learning

Undiscounted return is an important setup in reinforcement learning (RL) and characterizes many real-world problems. However, optimizing an undiscounted return often causes training instability. The causes of this instability problem have not been analyzed in-depth by existing studies. In this artic...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transaction on neural networks and learning systems 2023-12, Vol.34 (12), p.10359-10373
Hauptverfasser:	Gao, Haichuan, Yang, Zhile, Tan, Tian, Zhang, Tianren, Ren, Jinsheng, Sun, Pengfei, Guo, Shangqi, Chen, Feng
Format:	Artikel
Sprache:	eng
Schlagworte:	Automation Decomposition Estimation Exploration Instability Last visit (LV) Optimization partial consistency Reinforcement Reinforcement learning reinforcement learning (RL) Sampling Sampling methods Stability Stability analysis Task analysis Training Transient analysis transient trap Traps undiscounted return World problems
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Undiscounted return is an important setup in reinforcement learning (RL) and characterizes many real-world problems. However, optimizing an undiscounted return often causes training instability. The causes of this instability problem have not been analyzed in-depth by existing studies. In this article, this problem is analyzed from the perspective of value estimation. The analysis result indicates that the instability originates from transient traps that are caused by inconsistently selected actions. However, selecting one consistent action in the same state limits exploration. For balancing exploration effectiveness and training stability, a novel sampling method called last-visit sampling (LVS) is proposed to ensure that a part of actions is selected consistently in the same state. The LVS method decomposes the state-action value into two parts, i.e., the last-visit (LV) value and the revisit value. The decomposition ensures that the LV value is determined by consistently selected actions. We prove that the LVS method can eliminate transient traps while preserving optimality. Also, we empirically show that the method can stabilize the training processes of five typical tasks, including vision-based navigation and manipulation tasks.
ISSN:	2162-237X 2162-2388
DOI:	10.1109/TNNLS.2022.3165941