Trajectory Based Prioritized Double Experience Buffer for Sample-Efficient Policy Optimization

Reinforcement learning has recently made great progress in various challenging domains such as board game of Go and MOBA game of StarCraft II. Policy gradient based reinforcement learning method has become the mainstream due to its effectiveness and simplicity both in discrete and continuous scenari...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2021, Vol.9, p.101424-101432
Hauptverfasser:	Li, Shengxiang, Li, Ou, Liu, Guangyi, Ding, Siyuan, Bai, Yijie
Format:	Artikel
Sprache:	eng
Schlagworte:	Approximation Buffers Computer Science Computer Science, Information Systems distributed RL Engineering Engineering, Electrical & Electronic Games Gradient methods Learning Linear programming Optimization policy gradient Reinforcement learning replay buffer Science & Technology Technology Telecommunications Training Trajectory
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Reinforcement learning has recently made great progress in various challenging domains such as board game of Go and MOBA game of StarCraft II. Policy gradient based reinforcement learning method has become the mainstream due to its effectiveness and simplicity both in discrete and continuous scenarios. However, policy gradient methods commonly involve function approximation and work in an on-policy fashion, which leads to high variance and low sample efficiency. This paper introduces a novel policy gradient method to improve the sample efficiency via a pair of trajectory based prioritized replay buffers and reduce the variance in training with a target network whose weights are updated in a "soft" manner. We evaluate our method on the reinforcement learning suit of Open AI Gym tasks, and the results show that the proposed method can learn more steadily and achieve higher performance than existing methods.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2021.3097357