SPO: Sequential Monte Carlo Policy Optimisation

Leveraging planning during learning and decision-making is central to the long-term development of intelligent agents. Recent works have successfully combined tree-based search methods and self-play learning mechanisms to this end. However, these methods typically face scaling challenges due to the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Macfarlane, Matthew V, Toledo, Edan, Byrne, Donal, Duckworth, Paul, Laterre, Alexandre
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Macfarlane, Matthew V Toledo, Edan Byrne, Donal Duckworth, Paul Laterre, Alexandre
description	Leveraging planning during learning and decision-making is central to the long-term development of intelligent agents. Recent works have successfully combined tree-based search methods and self-play learning mechanisms to this end. However, these methods typically face scaling challenges due to the sequential nature of their search. While practical engineering solutions can partly overcome this, they often result in a negative impact on performance. In this paper, we introduce SPO: Sequential Monte Carlo Policy Optimisation, a model-based reinforcement learning algorithm grounded within the Expectation Maximisation (EM) framework. We show that SPO provides robust policy improvement and efficient scaling properties. The sample-based search makes it directly applicable to both discrete and continuous action spaces without modifications. We demonstrate statistically significant improvements in performance relative to model-free and model-based baselines across both continuous and discrete environments. Furthermore, the parallel nature of SPO's search enables effective utilisation of hardware accelerators, yielding favourable scaling laws.
doi_str_mv	10.48550/arxiv.2402.07963
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2402_07963</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2402_07963</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2402_079633</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw0jMwtzQz5mTQDw7wt1IITi0sTc0ryUzMUfDNzytJVXBOLMrJVwjIz8lMrlTwLyjJzM0sTizJzM_jYWBNS8wpTuWF0twM8m6uIc4eumCj4wuKMnMTiyrjQVbEg60wJqwCANwOMBE</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SPO: Sequential Monte Carlo Policy Optimisation</title><source>arXiv.org</source><creator>Macfarlane, Matthew V ; Toledo, Edan ; Byrne, Donal ; Duckworth, Paul ; Laterre, Alexandre</creator><creatorcontrib>Macfarlane, Matthew V ; Toledo, Edan ; Byrne, Donal ; Duckworth, Paul ; Laterre, Alexandre</creatorcontrib><description>Leveraging planning during learning and decision-making is central to the long-term development of intelligent agents. Recent works have successfully combined tree-based search methods and self-play learning mechanisms to this end. However, these methods typically face scaling challenges due to the sequential nature of their search. While practical engineering solutions can partly overcome this, they often result in a negative impact on performance. In this paper, we introduce SPO: Sequential Monte Carlo Policy Optimisation, a model-based reinforcement learning algorithm grounded within the Expectation Maximisation (EM) framework. We show that SPO provides robust policy improvement and efficient scaling properties. The sample-based search makes it directly applicable to both discrete and continuous action spaces without modifications. We demonstrate statistically significant improvements in performance relative to model-free and model-based baselines across both continuous and discrete environments. Furthermore, the parallel nature of SPO's search enables effective utilisation of hardware accelerators, yielding favourable scaling laws.</description><identifier>DOI: 10.48550/arxiv.2402.07963</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Learning</subject><creationdate>2024-02</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2402.07963$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2402.07963$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Macfarlane, Matthew V</creatorcontrib><creatorcontrib>Toledo, Edan</creatorcontrib><creatorcontrib>Byrne, Donal</creatorcontrib><creatorcontrib>Duckworth, Paul</creatorcontrib><creatorcontrib>Laterre, Alexandre</creatorcontrib><title>SPO: Sequential Monte Carlo Policy Optimisation</title><description>Leveraging planning during learning and decision-making is central to the long-term development of intelligent agents. Recent works have successfully combined tree-based search methods and self-play learning mechanisms to this end. However, these methods typically face scaling challenges due to the sequential nature of their search. While practical engineering solutions can partly overcome this, they often result in a negative impact on performance. In this paper, we introduce SPO: Sequential Monte Carlo Policy Optimisation, a model-based reinforcement learning algorithm grounded within the Expectation Maximisation (EM) framework. We show that SPO provides robust policy improvement and efficient scaling properties. The sample-based search makes it directly applicable to both discrete and continuous action spaces without modifications. We demonstrate statistically significant improvements in performance relative to model-free and model-based baselines across both continuous and discrete environments. Furthermore, the parallel nature of SPO's search enables effective utilisation of hardware accelerators, yielding favourable scaling laws.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw0jMwtzQz5mTQDw7wt1IITi0sTc0ryUzMUfDNzytJVXBOLMrJVwjIz8lMrlTwLyjJzM0sTizJzM_jYWBNS8wpTuWF0twM8m6uIc4eumCj4wuKMnMTiyrjQVbEg60wJqwCANwOMBE</recordid><startdate>20240212</startdate><enddate>20240212</enddate><creator>Macfarlane, Matthew V</creator><creator>Toledo, Edan</creator><creator>Byrne, Donal</creator><creator>Duckworth, Paul</creator><creator>Laterre, Alexandre</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240212</creationdate><title>SPO: Sequential Monte Carlo Policy Optimisation</title><author>Macfarlane, Matthew V ; Toledo, Edan ; Byrne, Donal ; Duckworth, Paul ; Laterre, Alexandre</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2402_079633</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Macfarlane, Matthew V</creatorcontrib><creatorcontrib>Toledo, Edan</creatorcontrib><creatorcontrib>Byrne, Donal</creatorcontrib><creatorcontrib>Duckworth, Paul</creatorcontrib><creatorcontrib>Laterre, Alexandre</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Macfarlane, Matthew V</au><au>Toledo, Edan</au><au>Byrne, Donal</au><au>Duckworth, Paul</au><au>Laterre, Alexandre</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SPO: Sequential Monte Carlo Policy Optimisation</atitle><date>2024-02-12</date><risdate>2024</risdate><abstract>Leveraging planning during learning and decision-making is central to the long-term development of intelligent agents. Recent works have successfully combined tree-based search methods and self-play learning mechanisms to this end. However, these methods typically face scaling challenges due to the sequential nature of their search. While practical engineering solutions can partly overcome this, they often result in a negative impact on performance. In this paper, we introduce SPO: Sequential Monte Carlo Policy Optimisation, a model-based reinforcement learning algorithm grounded within the Expectation Maximisation (EM) framework. We show that SPO provides robust policy improvement and efficient scaling properties. The sample-based search makes it directly applicable to both discrete and continuous action spaces without modifications. We demonstrate statistically significant improvements in performance relative to model-free and model-based baselines across both continuous and discrete environments. Furthermore, the parallel nature of SPO's search enables effective utilisation of hardware accelerators, yielding favourable scaling laws.</abstract><doi>10.48550/arxiv.2402.07963</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2402.07963
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2402_07963
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Learning
title	SPO: Sequential Monte Carlo Policy Optimisation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T21%3A26%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SPO:%20Sequential%20Monte%20Carlo%20Policy%20Optimisation&rft.au=Macfarlane,%20Matthew%20V&rft.date=2024-02-12&rft_id=info:doi/10.48550/arxiv.2402.07963&rft_dat=%3Carxiv_GOX%3E2402_07963%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true