ADER:Adapting between Exploration and Robustness for Actor-Critic Methods
Combining off-policy reinforcement learning methods with function approximators such as neural networks has been found to lead to overestimation of the value function and sub-optimal solutions. Improvement such as TD3 has been proposed to address this issue. However, we surprisingly find that its pe...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Combining off-policy reinforcement learning methods with function
approximators such as neural networks has been found to lead to overestimation
of the value function and sub-optimal solutions. Improvement such as TD3 has
been proposed to address this issue. However, we surprisingly find that its
performance lags behind the vanilla actor-critic methods (such as DDPG) in some
primitive environments. In this paper, we show that the failure of some cases
can be attributed to insufficient exploration. We reveal the culprit of
insufficient exploration in TD3, and propose a novel algorithm toward this
problem that ADapts between Exploration and Robustness, namely ADER. To enhance
the exploration ability while eliminating the overestimation bias, we introduce
a dynamic penalty term in value estimation calculated from estimated
uncertainty, which takes into account different compositions of the uncertainty
in different learning stages. Experiments in several challenging environments
demonstrate the supremacy of the proposed method in continuous control tasks. |
---|---|
DOI: | 10.48550/arxiv.2109.03443 |