Efficient Reinforcement Learning in Probabilistic Reward Machines

In this paper, we study reinforcement learning in Markov Decision Processes with Probabilistic Reward Machines (PRMs), a form of non-Markovian reward commonly found in robotics tasks. We design an algorithm for PRMs that achieves a regret bound of \(\widetilde{O}(\sqrt{HOAT} + H^2O^2A^{3/2} + H\sqrt...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-08
Hauptverfasser: Lin, Xiaofeng, Zhang, Xuezhou
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In this paper, we study reinforcement learning in Markov Decision Processes with Probabilistic Reward Machines (PRMs), a form of non-Markovian reward commonly found in robotics tasks. We design an algorithm for PRMs that achieves a regret bound of \(\widetilde{O}(\sqrt{HOAT} + H^2O^2A^{3/2} + H\sqrt{T})\), where \(H\) is the time horizon, \(O\) is the number of observations, \(A\) is the number of actions, and \(T\) is the number of time-steps. This result improves over the best-known bound, \(\widetilde{O}(H\sqrt{OAT})\) of \citet{pmlr-v206-bourel23a} for MDPs with Deterministic Reward Machines (DRMs), a special case of PRMs. When \(T \geq H^3O^3A^2\) and \(OA \geq H\), our regret bound leads to a regret of \(\widetilde{O}(\sqrt{HOAT})\), which matches the established lower bound of \(\Omega(\sqrt{HOAT})\) for MDPs with DRMs up to a logarithmic factor. To the best of our knowledge, this is the first efficient algorithm for PRMs. Additionally, we present a new simulation lemma for non-Markovian rewards, which enables reward-free exploration for any non-Markovian reward given access to an approximate planner. Complementing our theoretical findings, we show through extensive experiment evaluations that our algorithm indeed outperforms prior methods in various PRM environments.
ISSN:2331-8422