Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation
A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the prob...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2023-12 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Daoudi, Paul moso, Mathias Othman Gaizi Azize, Achraf Evrard Garcelon |
description | A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems. We leverage importance sampling techniques to counterfactually evaluate the conservative condition from the data self-generated by the algorithm. We derive a regret bound and show that (w.h.p.) the conservative constraint is never violated during learning. Finally, we leverage these insights to build a general schema for conservative exploration in DeepRL via off-policy policy evaluation techniques. We show empirically the effectiveness of our methods. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2906661697</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2906661697</sourcerecordid><originalsourceid>FETCH-proquest_journals_29066616973</originalsourceid><addsrcrecordid>eNqNitEKgjAYRkcQJOU7DLoezC1nXovRnQXdy5ANJst_bTqqp0_KB-jqfHznrFDCOM_I8cDYBqUh9JRSJgqW5zxB1wqGoHyUo4kK109nwc8bBqzB4wtY071w40ZzN-_fH43EjdZkcQvqKO30DXZoraUNKl24RftTfavOxHl4TCqMbQ-TH2bVspIKITJRFvy_6gMJCUAK</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2906661697</pqid></control><display><type>article</type><title>Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation</title><source>Free E- Journals</source><creator>Daoudi, Paul ; moso, Mathias ; Othman Gaizi ; Azize, Achraf ; Evrard Garcelon</creator><creatorcontrib>Daoudi, Paul ; moso, Mathias ; Othman Gaizi ; Azize, Achraf ; Evrard Garcelon</creatorcontrib><description>A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems. We leverage importance sampling techniques to counterfactually evaluate the conservative condition from the data self-generated by the algorithm. We derive a regret bound and show that (w.h.p.) the conservative constraint is never violated during learning. Finally, we leverage these insights to build a general schema for conservative exploration in DeepRL via off-policy policy evaluation techniques. We show empirically the effectiveness of our methods.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Importance sampling ; Machine learning ; Optimization</subject><ispartof>arXiv.org, 2023-12</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Daoudi, Paul</creatorcontrib><creatorcontrib>moso, Mathias</creatorcontrib><creatorcontrib>Othman Gaizi</creatorcontrib><creatorcontrib>Azize, Achraf</creatorcontrib><creatorcontrib>Evrard Garcelon</creatorcontrib><title>Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation</title><title>arXiv.org</title><description>A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems. We leverage importance sampling techniques to counterfactually evaluate the conservative condition from the data self-generated by the algorithm. We derive a regret bound and show that (w.h.p.) the conservative constraint is never violated during learning. Finally, we leverage these insights to build a general schema for conservative exploration in DeepRL via off-policy policy evaluation techniques. We show empirically the effectiveness of our methods.</description><subject>Algorithms</subject><subject>Importance sampling</subject><subject>Machine learning</subject><subject>Optimization</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNitEKgjAYRkcQJOU7DLoezC1nXovRnQXdy5ANJst_bTqqp0_KB-jqfHznrFDCOM_I8cDYBqUh9JRSJgqW5zxB1wqGoHyUo4kK109nwc8bBqzB4wtY071w40ZzN-_fH43EjdZkcQvqKO30DXZoraUNKl24RftTfavOxHl4TCqMbQ-TH2bVspIKITJRFvy_6gMJCUAK</recordid><startdate>20231224</startdate><enddate>20231224</enddate><creator>Daoudi, Paul</creator><creator>moso, Mathias</creator><creator>Othman Gaizi</creator><creator>Azize, Achraf</creator><creator>Evrard Garcelon</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231224</creationdate><title>Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation</title><author>Daoudi, Paul ; moso, Mathias ; Othman Gaizi ; Azize, Achraf ; Evrard Garcelon</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29066616973</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Importance sampling</topic><topic>Machine learning</topic><topic>Optimization</topic><toplevel>online_resources</toplevel><creatorcontrib>Daoudi, Paul</creatorcontrib><creatorcontrib>moso, Mathias</creatorcontrib><creatorcontrib>Othman Gaizi</creatorcontrib><creatorcontrib>Azize, Achraf</creatorcontrib><creatorcontrib>Evrard Garcelon</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Daoudi, Paul</au><au>moso, Mathias</au><au>Othman Gaizi</au><au>Azize, Achraf</au><au>Evrard Garcelon</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation</atitle><jtitle>arXiv.org</jtitle><date>2023-12-24</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems. We leverage importance sampling techniques to counterfactually evaluate the conservative condition from the data self-generated by the algorithm. We derive a regret bound and show that (w.h.p.) the conservative constraint is never violated during learning. Finally, we leverage these insights to build a general schema for conservative exploration in DeepRL via off-policy policy evaluation techniques. We show empirically the effectiveness of our methods.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2023-12 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2906661697 |
source | Free E- Journals |
subjects | Algorithms Importance sampling Machine learning Optimization |
title | Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T05%3A10%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Conservative%20Exploration%20for%20Policy%20Optimization%20via%20Off-Policy%20Policy%20Evaluation&rft.jtitle=arXiv.org&rft.au=Daoudi,%20Paul&rft.date=2023-12-24&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2906661697%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2906661697&rft_id=info:pmid/&rfr_iscdi=true |