Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning
Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. However, many existing analytical and empirical results for TS rely on restrictive assumptions on reward distributions, such as belonging to conjugate families, which limit...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Thompson sampling (TS) is widely used in sequential decision making due to
its ease of use and appealing empirical performance. However, many existing
analytical and empirical results for TS rely on restrictive assumptions on
reward distributions, such as belonging to conjugate families, which limits
their applicability in realistic scenarios. Moreover, sequential decision
making problems are often carried out in a batched manner, either due to the
inherent nature of the problem or to serve the purpose of reducing
communication and computation costs. In this work, we jointly study these
problems in two popular settings, namely, stochastic multi-armed bandits (MABs)
and infinite-horizon reinforcement learning (RL), where TS is used to learn the
unknown reward distributions and transition dynamics, respectively. We propose
batched $\textit{Langevin Thompson Sampling}$ algorithms that leverage MCMC
methods to sample from approximate posteriors with only logarithmic
communication costs in terms of batches. Our algorithms are computationally
efficient and maintain the same order-optimal regret guarantees of
$\mathcal{O}(\log T)$ for stochastic MABs, and $\mathcal{O}(\sqrt{T})$ for RL.
We complement our theoretical findings with experimental results. |
---|---|
DOI: | 10.48550/arxiv.2306.08803 |