Scalable Exploration via Ensemble
Scalable exploration in high-dimensional, complex environments is a significant challenge in sequential decision making, especially when utilizing neural networks. Ensemble sampling, a practical approximation of Thompson sampling, is widely adopted but often suffers performance degradation due to {e...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Scalable exploration in high-dimensional, complex environments is a
significant challenge in sequential decision making, especially when utilizing
neural networks. Ensemble sampling, a practical approximation of Thompson
sampling, is widely adopted but often suffers performance degradation due to
{ensemble coupling} in shared layer architectures, leading to reduced diversity
and ineffective exploration. In this paper, we introduce Ensemble++, a novel
method that addresses these challenges through architectural and algorithmic
innovations. To prevent ensemble coupling, Ensemble++ decouples mean and
uncertainty estimation by separating the base network and ensemble components,
employs a symmetrized loss function and the stop-gradient operator. To further
enhance exploration, it generates richer hypothesis spaces through random
linear combinations of ensemble components using continuous index sampling.
Theoretically, we prove that Ensemble++ matches the regret bounds of exact
Thompson sampling in linear contextual bandits while maintaining a scalable
per-step computational complexity of $\tilde{O}( \log T)$. This provides the
first rigorous analysis demonstrating that ensemble sampling can be an scalable
and effective approximation to Thompson Sampling, closing a key theoretical gap
in exploration efficiency. Empirically, we demonstrate Ensemble++'s
effectiveness in both regret minimization and computational efficiency across a
range of nonlinear bandit environments, including a language-based contextual
bandits where the agents employ GPT backbones. Our results highlight the
capability of Ensemble++ for real-time adaptation in complex environments where
computational and data collection budgets are constrained.
\url{https://github.com/szrlee/Ensemble_Plus_Plus} |
---|---|
DOI: | 10.48550/arxiv.2407.13195 |