Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes
We introduce and analyse two algorithms for exploration-exploitation in discrete and continuous Markov Decision Processes (MDPs) based on exploration bonuses. SCAL$^+$ is a variant of SCAL (Fruit et al., 2018) that performs efficient exploration-exploitation in any unknown weakly-communicating MDP f...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We introduce and analyse two algorithms for exploration-exploitation in
discrete and continuous Markov Decision Processes (MDPs) based on exploration
bonuses. SCAL$^+$ is a variant of SCAL (Fruit et al., 2018) that performs
efficient exploration-exploitation in any unknown weakly-communicating MDP for
which an upper bound C on the span of the optimal bias function is known. For
an MDP with $S$ states, $A$ actions and $\Gamma \leq S$ possible next states,
we prove that SCAL$^+$ achieves the same theoretical guarantees as SCAL (i.e.,
a high probability regret bound of $\widetilde{O}(C\sqrt{\Gamma SAT})$), with a
much smaller computational complexity. Similarly, C-SCAL$^+$ exploits an
exploration bonus to achieve sublinear regret in any undiscounted MDP with
continuous state space. We show that C-SCAL$^+$ achieves the same regret bound
as UCCRL (Ortner and Ryabko, 2012) while being the first implementable
algorithm with regret guarantees in this setting. While optimistic algorithms
such as UCRL, SCAL or UCCRL maintain a high-confidence set of plausible MDPs
around the true unknown MDP, SCAL$^+$ and C-SCAL$^+$ leverage on an exploration
bonus to directly plan on the empirically estimated MDP, thus being more
computationally efficient. |
---|---|
DOI: | 10.48550/arxiv.1812.04363 |