Efficient Reinforcement Learning in Probabilistic Reward Machines
In this paper, we study reinforcement learning in Markov Decision Processes with Probabilistic Reward Machines (PRMs), a form of non-Markovian reward commonly found in robotics tasks. We design an algorithm for PRMs that achieves a regret bound of \(\widetilde{O}(\sqrt{HOAT} + H^2O^2A^{3/2} + H\sqrt...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-08 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we study reinforcement learning in Markov Decision Processes with Probabilistic Reward Machines (PRMs), a form of non-Markovian reward commonly found in robotics tasks. We design an algorithm for PRMs that achieves a regret bound of \(\widetilde{O}(\sqrt{HOAT} + H^2O^2A^{3/2} + H\sqrt{T})\), where \(H\) is the time horizon, \(O\) is the number of observations, \(A\) is the number of actions, and \(T\) is the number of time-steps. This result improves over the best-known bound, \(\widetilde{O}(H\sqrt{OAT})\) of \citet{pmlr-v206-bourel23a} for MDPs with Deterministic Reward Machines (DRMs), a special case of PRMs. When \(T \geq H^3O^3A^2\) and \(OA \geq H\), our regret bound leads to a regret of \(\widetilde{O}(\sqrt{HOAT})\), which matches the established lower bound of \(\Omega(\sqrt{HOAT})\) for MDPs with DRMs up to a logarithmic factor. To the best of our knowledge, this is the first efficient algorithm for PRMs. Additionally, we present a new simulation lemma for non-Markovian rewards, which enables reward-free exploration for any non-Markovian reward given access to an approximate planner. Complementing our theoretical findings, we show through extensive experiment evaluations that our algorithm indeed outperforms prior methods in various PRM environments. |
---|---|
ISSN: | 2331-8422 |