Anti-Concentrated Confidence Bonuses for Scalable Exploration
International Conference on Learning Representations 2022 Intrinsic rewards play a central role in handling the exploration-exploitation trade-off when designing sequential decision-making algorithms, in both foundational theory and state-of-the-art deep reinforcement learning. The LinUCB algorithm,...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | International Conference on Learning Representations 2022 Intrinsic rewards play a central role in handling the
exploration-exploitation trade-off when designing sequential decision-making
algorithms, in both foundational theory and state-of-the-art deep reinforcement
learning. The LinUCB algorithm, a centerpiece of the stochastic linear bandits
literature, prescribes an elliptical bonus which addresses the challenge of
leveraging shared information in large action spaces. This bonus scheme cannot
be directly transferred to high-dimensional exploration problems, however, due
to the computational cost of maintaining the inverse covariance matrix of
action features. We introduce \emph{anti-concentrated confidence bounds} for
efficiently approximating the elliptical bonus, using an ensemble of regressors
trained to predict random noise from policy network-derived features. Using
this approximation, we obtain stochastic linear bandit algorithms which obtain
$\tilde O(d \sqrt{T})$ regret bounds for $\mathrm{poly}(d)$ fixed actions. We
develop a practical variant for deep reinforcement learning that is competitive
with contemporary intrinsic reward heuristics on Atari benchmarks. |
---|---|
DOI: | 10.48550/arxiv.2110.11202 |