Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-free RL
Reward-free reinforcement learning (RF-RL), a recently introduced RL paradigm, relies on random action-taking to explore the unknown environment without any reward feedback information. While the primary goal of the exploration phase in RF-RL is to reduce the uncertainty in the estimated model with...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Reward-free reinforcement learning (RF-RL), a recently introduced RL
paradigm, relies on random action-taking to explore the unknown environment
without any reward feedback information. While the primary goal of the
exploration phase in RF-RL is to reduce the uncertainty in the estimated model
with minimum number of trajectories, in practice, the agent often needs to
abide by certain safety constraint at the same time. It remains unclear how
such safe exploration requirement would affect the corresponding sample
complexity in order to achieve the desired optimality of the obtained policy in
planning. In this work, we make a first attempt to answer this question. In
particular, we consider the scenario where a safe baseline policy is known
beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET)
framework. We then particularize the SWEET framework to the tabular and the
low-rank MDP settings, and develop algorithms coined Tabular-SWEET and
Low-rank-SWEET, respectively. Both algorithms leverage the concavity and
continuity of the newly introduced truncated value functions, and are
guaranteed to achieve zero constraint violation during exploration with high
probability. Furthermore, both algorithms can provably find a near-optimal
policy subject to any constraint in the planning phase. Remarkably, the sample
complexities under both algorithms match or even outperform the state of the
art in their constraint-free counterparts up to some constant factors, proving
that safety constraint hardly increases the sample complexity for RF-RL. |
---|---|
DOI: | 10.48550/arxiv.2206.14057 |