Oracle-Efficient Reinforcement Learning for Max Value Ensembles

Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale po...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Hussing, Marcel, Kearns, Michael, Roth, Aaron, Sengupta, Sikata Bela, Sorrell, Jessica
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Learning Computer Science - Systems and Control
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Hussing, Marcel Kearns, Michael Roth, Aaron Sengupta, Sikata Bela Sorrell, Jessica
description	Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of heuristic base or $\textit{constituent}$ policies upon which we would like to improve in a scalable manner. In this work we aim to compete with the $\textit{max-following policy}$, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.
doi_str_mv	10.48550/arxiv.2405.16739
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_16739</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_16739</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-1767662691fced73819c20bc62009e7d57af6d6d6f8cc7dc0856058f703bc5e33</originalsourceid><addsrcrecordid>eNotj81KxDAUhbNxIaMP4Mq8QGvSTO5NViJD_YHKgAxuS3p7I4E2Sqoyvr0zo5zF4XyLA58QV1rVa2etuglln77rZq1srQGNPxe32xJo4qqNMVHi_ClfOOX4Xojn4-o4lJzymzwg-Rz28jVMXyzbvPA8TLxciLMYpoUv_3sldvftbvNYdduHp81dVwVAX2kEBGjA60g8onHaU6MGgkYpzzhaDBHGQ6IjwpGUs6Csi6jMQJaNWYnrv9uTQf9R0hzKT3806U8m5heAzULp</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Oracle-Efficient Reinforcement Learning for Max Value Ensembles</title><source>arXiv.org</source><creator>Hussing, Marcel ; Kearns, Michael ; Roth, Aaron ; Sengupta, Sikata Bela ; Sorrell, Jessica</creator><creatorcontrib>Hussing, Marcel ; Kearns, Michael ; Roth, Aaron ; Sengupta, Sikata Bela ; Sorrell, Jessica</creatorcontrib><description>Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of heuristic base or $\textit{constituent}$ policies upon which we would like to improve in a scalable manner. In this work we aim to compete with the $\textit{max-following policy}$, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.</description><identifier>DOI: 10.48550/arxiv.2405.16739</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Learning ; Computer Science - Systems and Control</subject><creationdate>2024-05</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.16739$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.16739$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Hussing, Marcel</creatorcontrib><creatorcontrib>Kearns, Michael</creatorcontrib><creatorcontrib>Roth, Aaron</creatorcontrib><creatorcontrib>Sengupta, Sikata Bela</creatorcontrib><creatorcontrib>Sorrell, Jessica</creatorcontrib><title>Oracle-Efficient Reinforcement Learning for Max Value Ensembles</title><description>Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of heuristic base or $\textit{constituent}$ policies upon which we would like to improve in a scalable manner. In this work we aim to compete with the $\textit{max-following policy}$, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Systems and Control</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81KxDAUhbNxIaMP4Mq8QGvSTO5NViJD_YHKgAxuS3p7I4E2Sqoyvr0zo5zF4XyLA58QV1rVa2etuglln77rZq1srQGNPxe32xJo4qqNMVHi_ClfOOX4Xojn4-o4lJzymzwg-Rz28jVMXyzbvPA8TLxciLMYpoUv_3sldvftbvNYdduHp81dVwVAX2kEBGjA60g8onHaU6MGgkYpzzhaDBHGQ6IjwpGUs6Csi6jMQJaNWYnrv9uTQf9R0hzKT3806U8m5heAzULp</recordid><startdate>20240526</startdate><enddate>20240526</enddate><creator>Hussing, Marcel</creator><creator>Kearns, Michael</creator><creator>Roth, Aaron</creator><creator>Sengupta, Sikata Bela</creator><creator>Sorrell, Jessica</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240526</creationdate><title>Oracle-Efficient Reinforcement Learning for Max Value Ensembles</title><author>Hussing, Marcel ; Kearns, Michael ; Roth, Aaron ; Sengupta, Sikata Bela ; Sorrell, Jessica</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-1767662691fced73819c20bc62009e7d57af6d6d6f8cc7dc0856058f703bc5e33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Systems and Control</topic><toplevel>online_resources</toplevel><creatorcontrib>Hussing, Marcel</creatorcontrib><creatorcontrib>Kearns, Michael</creatorcontrib><creatorcontrib>Roth, Aaron</creatorcontrib><creatorcontrib>Sengupta, Sikata Bela</creatorcontrib><creatorcontrib>Sorrell, Jessica</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hussing, Marcel</au><au>Kearns, Michael</au><au>Roth, Aaron</au><au>Sengupta, Sikata Bela</au><au>Sorrell, Jessica</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Oracle-Efficient Reinforcement Learning for Max Value Ensembles</atitle><date>2024-05-26</date><risdate>2024</risdate><abstract>Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of heuristic base or $\textit{constituent}$ policies upon which we would like to improve in a scalable manner. In this work we aim to compete with the $\textit{max-following policy}$, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.</abstract><doi>10.48550/arxiv.2405.16739</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2405.16739
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2405_16739
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Learning Computer Science - Systems and Control
title	Oracle-Efficient Reinforcement Learning for Max Value Ensembles
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T19%3A10%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Oracle-Efficient%20Reinforcement%20Learning%20for%20Max%20Value%20Ensembles&rft.au=Hussing,%20Marcel&rft.date=2024-05-26&rft_id=info:doi/10.48550/arxiv.2405.16739&rft_dat=%3Carxiv_GOX%3E2405_16739%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true