Optimistic PAC Reinforcement Learning: the Instance-Dependent View
Optimistic algorithms have been extensively studied for regret minimization in episodic tabular MDPs, both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-d...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Optimistic algorithms have been extensively studied for regret minimization
in episodic tabular MDPs, both from a minimax and an instance-dependent view.
However, for the PAC RL problem, where the goal is to identify a near-optimal
policy with high probability, little is known about their instance-dependent
sample complexity. A negative result of Wagenmaker et al. (2021) suggests that
optimistic sampling rules cannot be used to attain the (still elusive) optimal
instance-dependent sample complexity. On the positive side, we provide the
first instance-dependent bound for an optimistic algorithm for PAC RL,
BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al.,
2021). While our bound features some minimal visitation probabilities, it also
features a refined notion of sub-optimality gap compared to the value gaps that
appear in prior work. Moreover, in MDPs with deterministic transitions, we show
that BPI-UCRL is actually near-optimal. On the technical side, our analysis is
very simple thanks to a new "target trick" of independent interest. We
complement these findings with a novel hardness result explaining why the
instance-dependent complexity of PAC RL cannot be easily related to that of
regret minimization, unlike in the minimax regime. |
---|---|
DOI: | 10.48550/arxiv.2207.05852 |