Decoupled Exploration and Exploitation Policies for Sample-Efficient Reinforcement Learning
Despite the close connection between exploration and sample efficiency, most state of the art reinforcement learning algorithms include no considerations for exploration beyond maximizing the entropy of the policy. In this work we address this seeming missed opportunity. We observe that the most com...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Despite the close connection between exploration and sample efficiency, most
state of the art reinforcement learning algorithms include no considerations
for exploration beyond maximizing the entropy of the policy. In this work we
address this seeming missed opportunity. We observe that the most common
formulation of directed exploration in deep RL, known as bonus-based
exploration (BBE), suffers from bias and slow coverage in the few-sample
regime. This causes BBE to be actively detrimental to policy learning in many
control tasks. We show that by decoupling the task policy from the exploration
policy, directed exploration can be highly effective for sample-efficient
continuous control. Our method, Decoupled Exploration and Exploitation Policies
(DEEP), can be combined with any off-policy RL algorithm without modification.
When used in conjunction with soft actor-critic, DEEP incurs no performance
penalty in densely-rewarding environments. On sparse environments, DEEP gives a
several-fold improvement in data efficiency due to better exploration. |
---|---|
DOI: | 10.48550/arxiv.2101.09458 |