Posterior Sampling for Continuing Environments
We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-08 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected \(\gamma\)-discounted return in that model. At each time, with probability \(1-\gamma\), the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon \(T\), we establish an \(\tilde{O}(\tau S \sqrt{A T})\) bound on the Bayesian regret, where \(S\) is the number of environment states, \(A\) is the number of actions, and \(\tau\) denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration. |
---|---|
ISSN: | 2331-8422 |