Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning
We introduce SCAL, an algorithm designed to perform efficient exploration-exploitation in any unknown weakly-communicating Markov decision process (MDP) for which an upper bound $c$ on the span of the optimal bias function is known. For an MDP with $S$ states, $A$ actions and $\Gamma \leq S$ possibl...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We introduce SCAL, an algorithm designed to perform efficient
exploration-exploitation in any unknown weakly-communicating Markov decision
process (MDP) for which an upper bound $c$ on the span of the optimal bias
function is known. For an MDP with $S$ states, $A$ actions and $\Gamma \leq S$
possible next states, we prove a regret bound of $\widetilde{O}(c\sqrt{\Gamma
SAT})$, which significantly improves over existing algorithms (e.g., UCRL and
PSRL), whose regret scales linearly with the MDP diameter $D$. In fact, the
optimal bias span is finite and often much smaller than $D$ (e.g., $D=\infty$
in non-communicating MDPs). A similar result was originally derived by Bartlett
and Tewari (2009) for REGAL.C, for which no tractable algorithm is available.
In this paper, we relax the optimization problem at the core of REGAL.C, we
carefully analyze its properties, and we provide the first computationally
efficient algorithm to solve it. Finally, we report numerical simulations
supporting our theoretical findings and showing how SCAL significantly
outperforms UCRL in MDPs with large diameter and small span. |
---|---|
DOI: | 10.48550/arxiv.1802.04020 |