Near-optimal Conservative Exploration in Reinforcement Learning under Episode-wise Constraints
This paper investigates conservative exploration in reinforcement learning where the performance of the learning agent is guaranteed to be above a certain threshold throughout the learning process. It focuses on the tabular episodic Markov Decision Process (MDP) setting that has finite states and ac...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper investigates conservative exploration in reinforcement learning
where the performance of the learning agent is guaranteed to be above a certain
threshold throughout the learning process. It focuses on the tabular episodic
Markov Decision Process (MDP) setting that has finite states and actions. With
the knowledge of an existing safe baseline policy, an algorithm termed as
StepMix is proposed to balance the exploitation and exploration while ensuring
that the conservative constraint is never violated in each episode with high
probability. StepMix features a unique design of a mixture policy that
adaptively and smoothly interpolates between the baseline policy and the
optimistic policy. Theoretical analysis shows that StepMix achieves
near-optimal regret order as in the constraint-free setting, indicating that
obeying the stringent episode-wise conservative constraint does not compromise
the learning performance. Besides, a randomization-based EpsMix algorithm is
also proposed and shown to achieve the same performance as StepMix. The
algorithm design and theoretical analysis are further extended to the setting
where the baseline policy is not given a priori but must be learned from an
offline dataset, and it is proved that similar conservative guarantee and
regret can be achieved if the offline dataset is sufficiently large. Experiment
results corroborate the theoretical analysis and demonstrate the effectiveness
of the proposed conservative exploration strategies. |
---|---|
DOI: | 10.48550/arxiv.2306.06265 |