Oracle-Efficient Reinforcement Learning for Max Value Ensembles

Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale po...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Hussing, Marcel, Kearns, Michael, Roth, Aaron, Sengupta, Sikata Bela, Sorrell, Jessica
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Hussing, Marcel
Kearns, Michael
Roth, Aaron
Sengupta, Sikata Bela
Sorrell, Jessica
description Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of heuristic base or $\textit{constituent}$ policies upon which we would like to improve in a scalable manner. In this work we aim to compete with the $\textit{max-following policy}$, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.
doi_str_mv 10.48550/arxiv.2405.16739
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_16739</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_16739</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-1767662691fced73819c20bc62009e7d57af6d6d6f8cc7dc0856058f703bc5e33</originalsourceid><addsrcrecordid>eNotj81KxDAUhbNxIaMP4Mq8QGvSTO5NViJD_YHKgAxuS3p7I4E2Sqoyvr0zo5zF4XyLA58QV1rVa2etuglln77rZq1srQGNPxe32xJo4qqNMVHi_ClfOOX4Xojn4-o4lJzymzwg-Rz28jVMXyzbvPA8TLxciLMYpoUv_3sldvftbvNYdduHp81dVwVAX2kEBGjA60g8onHaU6MGgkYpzzhaDBHGQ6IjwpGUs6Csi6jMQJaNWYnrv9uTQf9R0hzKT3806U8m5heAzULp</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Oracle-Efficient Reinforcement Learning for Max Value Ensembles</title><source>arXiv.org</source><creator>Hussing, Marcel ; Kearns, Michael ; Roth, Aaron ; Sengupta, Sikata Bela ; Sorrell, Jessica</creator><creatorcontrib>Hussing, Marcel ; Kearns, Michael ; Roth, Aaron ; Sengupta, Sikata Bela ; Sorrell, Jessica</creatorcontrib><description>Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of heuristic base or $\textit{constituent}$ policies upon which we would like to improve in a scalable manner. In this work we aim to compete with the $\textit{max-following policy}$, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.</description><identifier>DOI: 10.48550/arxiv.2405.16739</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Learning ; Computer Science - Systems and Control</subject><creationdate>2024-05</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.16739$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.16739$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Hussing, Marcel</creatorcontrib><creatorcontrib>Kearns, Michael</creatorcontrib><creatorcontrib>Roth, Aaron</creatorcontrib><creatorcontrib>Sengupta, Sikata Bela</creatorcontrib><creatorcontrib>Sorrell, Jessica</creatorcontrib><title>Oracle-Efficient Reinforcement Learning for Max Value Ensembles</title><description>Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of heuristic base or $\textit{constituent}$ policies upon which we would like to improve in a scalable manner. In this work we aim to compete with the $\textit{max-following policy}$, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Systems and Control</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81KxDAUhbNxIaMP4Mq8QGvSTO5NViJD_YHKgAxuS3p7I4E2Sqoyvr0zo5zF4XyLA58QV1rVa2etuglln77rZq1srQGNPxe32xJo4qqNMVHi_ClfOOX4Xojn4-o4lJzymzwg-Rz28jVMXyzbvPA8TLxciLMYpoUv_3sldvftbvNYdduHp81dVwVAX2kEBGjA60g8onHaU6MGgkYpzzhaDBHGQ6IjwpGUs6Csi6jMQJaNWYnrv9uTQf9R0hzKT3806U8m5heAzULp</recordid><startdate>20240526</startdate><enddate>20240526</enddate><creator>Hussing, Marcel</creator><creator>Kearns, Michael</creator><creator>Roth, Aaron</creator><creator>Sengupta, Sikata Bela</creator><creator>Sorrell, Jessica</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240526</creationdate><title>Oracle-Efficient Reinforcement Learning for Max Value Ensembles</title><author>Hussing, Marcel ; Kearns, Michael ; Roth, Aaron ; Sengupta, Sikata Bela ; Sorrell, Jessica</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-1767662691fced73819c20bc62009e7d57af6d6d6f8cc7dc0856058f703bc5e33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Systems and Control</topic><toplevel>online_resources</toplevel><creatorcontrib>Hussing, Marcel</creatorcontrib><creatorcontrib>Kearns, Michael</creatorcontrib><creatorcontrib>Roth, Aaron</creatorcontrib><creatorcontrib>Sengupta, Sikata Bela</creatorcontrib><creatorcontrib>Sorrell, Jessica</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hussing, Marcel</au><au>Kearns, Michael</au><au>Roth, Aaron</au><au>Sengupta, Sikata Bela</au><au>Sorrell, Jessica</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Oracle-Efficient Reinforcement Learning for Max Value Ensembles</atitle><date>2024-05-26</date><risdate>2024</risdate><abstract>Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of heuristic base or $\textit{constituent}$ policies upon which we would like to improve in a scalable manner. In this work we aim to compete with the $\textit{max-following policy}$, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.</abstract><doi>10.48550/arxiv.2405.16739</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2405.16739
ispartof
issn
language eng
recordid cdi_arxiv_primary_2405_16739
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Learning
Computer Science - Systems and Control
title Oracle-Efficient Reinforcement Learning for Max Value Ensembles
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T19%3A10%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Oracle-Efficient%20Reinforcement%20Learning%20for%20Max%20Value%20Ensembles&rft.au=Hussing,%20Marcel&rft.date=2024-05-26&rft_id=info:doi/10.48550/arxiv.2405.16739&rft_dat=%3Carxiv_GOX%3E2405_16739%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true