Navigating to the Best Policy in Markov Decision Processes
We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We investigate the classical active pure exploration problem in Markov
Decision Processes, where the agent sequentially selects actions and, from the
resulting system trajectory, aims at identifying the best policy as fast as
possible. We propose a problem-dependent lower bound on the average number of
steps required before a correct answer can be given with probability at least
$1-\delta$. We further provide the first algorithm with an instance-specific
sample complexity in this setting. This algorithm addresses the general case of
communicating MDPs; we also propose a variant with a reduced exploration rate
(and hence faster convergence) under an additional ergodicity assumption. This
work extends previous results relative to the \emph{generative
setting}~\cite{pmlr-v139-marjani21a}, where the agent could at each step query
the random outcome of any (state, action) pair. In contrast, we show here how
to deal with the \emph{navigation constraints}, induced by the \emph{online
setting}. Our analysis relies on an ergodic theorem for non-homogeneous Markov
chains which we consider of wide interest in the analysis of Markov Decision
Processes. |
---|---|
DOI: | 10.48550/arxiv.2106.02847 |