SPO: Sequential Monte Carlo Policy Optimisation
Leveraging planning during learning and decision-making is central to the long-term development of intelligent agents. Recent works have successfully combined tree-based search methods and self-play learning mechanisms to this end. However, these methods typically face scaling challenges due to the...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Leveraging planning during learning and decision-making is central to the
long-term development of intelligent agents. Recent works have successfully
combined tree-based search methods and self-play learning mechanisms to this
end. However, these methods typically face scaling challenges due to the
sequential nature of their search. While practical engineering solutions can
partly overcome this, they often result in a negative impact on performance. In
this paper, we introduce SPO: Sequential Monte Carlo Policy Optimisation, a
model-based reinforcement learning algorithm grounded within the Expectation
Maximisation (EM) framework. We show that SPO provides robust policy
improvement and efficient scaling properties. The sample-based search makes it
directly applicable to both discrete and continuous action spaces without
modifications. We demonstrate statistically significant improvements in
performance relative to model-free and model-based baselines across both
continuous and discrete environments. Furthermore, the parallel nature of SPO's
search enables effective utilisation of hardware accelerators, yielding
favourable scaling laws. |
---|---|
DOI: | 10.48550/arxiv.2402.07963 |