Efficient Reinforcement Learning in Probabilistic Reward Machines
In this paper, we study reinforcement learning in Markov Decision Processes with Probabilistic Reward Machines (PRMs), a form of non-Markovian reward commonly found in robotics tasks. We design an algorithm for PRMs that achieves a regret bound of $\widetilde{O}(\sqrt{HOAT} + H^2O^2A^{3/2} + H\sqrt{...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we study reinforcement learning in Markov Decision Processes
with Probabilistic Reward Machines (PRMs), a form of non-Markovian reward
commonly found in robotics tasks. We design an algorithm for PRMs that achieves
a regret bound of $\widetilde{O}(\sqrt{HOAT} + H^2O^2A^{3/2} + H\sqrt{T})$,
where $H$ is the time horizon, $O$ is the number of observations, $A$ is the
number of actions, and $T$ is the number of time-steps. This result improves
over the best-known bound, $\widetilde{O}(H\sqrt{OAT})$ of
\citet{pmlr-v206-bourel23a} for MDPs with Deterministic Reward Machines (DRMs),
a special case of PRMs. When $T \geq H^3O^3A^2$ and $OA \geq H$, our regret
bound leads to a regret of $\widetilde{O}(\sqrt{HOAT})$, which matches the
established lower bound of $\Omega(\sqrt{HOAT})$ for MDPs with DRMs up to a
logarithmic factor. To the best of our knowledge, this is the first efficient
algorithm for PRMs. Additionally, we present a new simulation lemma for
non-Markovian rewards, which enables reward-free exploration for any
non-Markovian reward given access to an approximate planner. Complementing our
theoretical findings, we show through extensive experiment evaluations that our
algorithm indeed outperforms prior methods in various PRM environments. |
---|---|
DOI: | 10.48550/arxiv.2408.10381 |