Oracle-Efficient Reinforcement Learning for Max Value Ensembles
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale po...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Hussing, Marcel Kearns, Michael Roth, Aaron Sengupta, Sikata Bela Sorrell, Jessica |
description | Reinforcement learning (RL) in large or infinite state spaces is notoriously
challenging, both theoretically (where worst-case sample and computational
complexities must scale with state space cardinality) and experimentally (where
function approximation and policy gradient techniques often scale poorly and
suffer from instability and high variance). One line of research attempting to
address these difficulties makes the natural assumption that we are given a
collection of heuristic base or $\textit{constituent}$ policies upon which we
would like to improve in a scalable manner. In this work we aim to compete with
the $\textit{max-following policy}$, which at each state follows the action of
whichever constituent policy has the highest value. The max-following policy is
always at least as good as the best constituent policy, and may be considerably
better. Our main result is an efficient algorithm that learns to compete with
the max-following policy, given only access to the constituent policies (but
not their value functions). In contrast to prior work in similar settings, our
theoretical results require only the minimal assumption of an ERM oracle for
value function approximation for the constituent policies (and not the global
optimal policy or the max-following policy itself) on samplable distributions.
We illustrate our algorithm's experimental effectiveness and behavior on
several robotic simulation testbeds. |
doi_str_mv | 10.48550/arxiv.2405.16739 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_16739</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_16739</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-1767662691fced73819c20bc62009e7d57af6d6d6f8cc7dc0856058f703bc5e33</originalsourceid><addsrcrecordid>eNotj81KxDAUhbNxIaMP4Mq8QGvSTO5NViJD_YHKgAxuS3p7I4E2Sqoyvr0zo5zF4XyLA58QV1rVa2etuglln77rZq1srQGNPxe32xJo4qqNMVHi_ClfOOX4Xojn4-o4lJzymzwg-Rz28jVMXyzbvPA8TLxciLMYpoUv_3sldvftbvNYdduHp81dVwVAX2kEBGjA60g8onHaU6MGgkYpzzhaDBHGQ6IjwpGUs6Csi6jMQJaNWYnrv9uTQf9R0hzKT3806U8m5heAzULp</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Oracle-Efficient Reinforcement Learning for Max Value Ensembles</title><source>arXiv.org</source><creator>Hussing, Marcel ; Kearns, Michael ; Roth, Aaron ; Sengupta, Sikata Bela ; Sorrell, Jessica</creator><creatorcontrib>Hussing, Marcel ; Kearns, Michael ; Roth, Aaron ; Sengupta, Sikata Bela ; Sorrell, Jessica</creatorcontrib><description>Reinforcement learning (RL) in large or infinite state spaces is notoriously
challenging, both theoretically (where worst-case sample and computational
complexities must scale with state space cardinality) and experimentally (where
function approximation and policy gradient techniques often scale poorly and
suffer from instability and high variance). One line of research attempting to
address these difficulties makes the natural assumption that we are given a
collection of heuristic base or $\textit{constituent}$ policies upon which we
would like to improve in a scalable manner. In this work we aim to compete with
the $\textit{max-following policy}$, which at each state follows the action of
whichever constituent policy has the highest value. The max-following policy is
always at least as good as the best constituent policy, and may be considerably
better. Our main result is an efficient algorithm that learns to compete with
the max-following policy, given only access to the constituent policies (but
not their value functions). In contrast to prior work in similar settings, our
theoretical results require only the minimal assumption of an ERM oracle for
value function approximation for the constituent policies (and not the global
optimal policy or the max-following policy itself) on samplable distributions.
We illustrate our algorithm's experimental effectiveness and behavior on
several robotic simulation testbeds.</description><identifier>DOI: 10.48550/arxiv.2405.16739</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Learning ; Computer Science - Systems and Control</subject><creationdate>2024-05</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.16739$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.16739$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Hussing, Marcel</creatorcontrib><creatorcontrib>Kearns, Michael</creatorcontrib><creatorcontrib>Roth, Aaron</creatorcontrib><creatorcontrib>Sengupta, Sikata Bela</creatorcontrib><creatorcontrib>Sorrell, Jessica</creatorcontrib><title>Oracle-Efficient Reinforcement Learning for Max Value Ensembles</title><description>Reinforcement learning (RL) in large or infinite state spaces is notoriously
challenging, both theoretically (where worst-case sample and computational
complexities must scale with state space cardinality) and experimentally (where
function approximation and policy gradient techniques often scale poorly and
suffer from instability and high variance). One line of research attempting to
address these difficulties makes the natural assumption that we are given a
collection of heuristic base or $\textit{constituent}$ policies upon which we
would like to improve in a scalable manner. In this work we aim to compete with
the $\textit{max-following policy}$, which at each state follows the action of
whichever constituent policy has the highest value. The max-following policy is
always at least as good as the best constituent policy, and may be considerably
better. Our main result is an efficient algorithm that learns to compete with
the max-following policy, given only access to the constituent policies (but
not their value functions). In contrast to prior work in similar settings, our
theoretical results require only the minimal assumption of an ERM oracle for
value function approximation for the constituent policies (and not the global
optimal policy or the max-following policy itself) on samplable distributions.
We illustrate our algorithm's experimental effectiveness and behavior on
several robotic simulation testbeds.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Systems and Control</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81KxDAUhbNxIaMP4Mq8QGvSTO5NViJD_YHKgAxuS3p7I4E2Sqoyvr0zo5zF4XyLA58QV1rVa2etuglln77rZq1srQGNPxe32xJo4qqNMVHi_ClfOOX4Xojn4-o4lJzymzwg-Rz28jVMXyzbvPA8TLxciLMYpoUv_3sldvftbvNYdduHp81dVwVAX2kEBGjA60g8onHaU6MGgkYpzzhaDBHGQ6IjwpGUs6Csi6jMQJaNWYnrv9uTQf9R0hzKT3806U8m5heAzULp</recordid><startdate>20240526</startdate><enddate>20240526</enddate><creator>Hussing, Marcel</creator><creator>Kearns, Michael</creator><creator>Roth, Aaron</creator><creator>Sengupta, Sikata Bela</creator><creator>Sorrell, Jessica</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240526</creationdate><title>Oracle-Efficient Reinforcement Learning for Max Value Ensembles</title><author>Hussing, Marcel ; Kearns, Michael ; Roth, Aaron ; Sengupta, Sikata Bela ; Sorrell, Jessica</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-1767662691fced73819c20bc62009e7d57af6d6d6f8cc7dc0856058f703bc5e33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Systems and Control</topic><toplevel>online_resources</toplevel><creatorcontrib>Hussing, Marcel</creatorcontrib><creatorcontrib>Kearns, Michael</creatorcontrib><creatorcontrib>Roth, Aaron</creatorcontrib><creatorcontrib>Sengupta, Sikata Bela</creatorcontrib><creatorcontrib>Sorrell, Jessica</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hussing, Marcel</au><au>Kearns, Michael</au><au>Roth, Aaron</au><au>Sengupta, Sikata Bela</au><au>Sorrell, Jessica</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Oracle-Efficient Reinforcement Learning for Max Value Ensembles</atitle><date>2024-05-26</date><risdate>2024</risdate><abstract>Reinforcement learning (RL) in large or infinite state spaces is notoriously
challenging, both theoretically (where worst-case sample and computational
complexities must scale with state space cardinality) and experimentally (where
function approximation and policy gradient techniques often scale poorly and
suffer from instability and high variance). One line of research attempting to
address these difficulties makes the natural assumption that we are given a
collection of heuristic base or $\textit{constituent}$ policies upon which we
would like to improve in a scalable manner. In this work we aim to compete with
the $\textit{max-following policy}$, which at each state follows the action of
whichever constituent policy has the highest value. The max-following policy is
always at least as good as the best constituent policy, and may be considerably
better. Our main result is an efficient algorithm that learns to compete with
the max-following policy, given only access to the constituent policies (but
not their value functions). In contrast to prior work in similar settings, our
theoretical results require only the minimal assumption of an ERM oracle for
value function approximation for the constituent policies (and not the global
optimal policy or the max-following policy itself) on samplable distributions.
We illustrate our algorithm's experimental effectiveness and behavior on
several robotic simulation testbeds.</abstract><doi>10.48550/arxiv.2405.16739</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2405.16739 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2405_16739 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Learning Computer Science - Systems and Control |
title | Oracle-Efficient Reinforcement Learning for Max Value Ensembles |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T19%3A10%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Oracle-Efficient%20Reinforcement%20Learning%20for%20Max%20Value%20Ensembles&rft.au=Hussing,%20Marcel&rft.date=2024-05-26&rft_id=info:doi/10.48550/arxiv.2405.16739&rft_dat=%3Carxiv_GOX%3E2405_16739%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |