Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation

A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the prob...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2023-12
Hauptverfasser: Daoudi, Paul, moso, Mathias, Othman Gaizi, Azize, Achraf, Evrard Garcelon
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Daoudi, Paul
moso, Mathias
Othman Gaizi
Azize, Achraf
Evrard Garcelon
description A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems. We leverage importance sampling techniques to counterfactually evaluate the conservative condition from the data self-generated by the algorithm. We derive a regret bound and show that (w.h.p.) the conservative constraint is never violated during learning. Finally, we leverage these insights to build a general schema for conservative exploration in DeepRL via off-policy policy evaluation techniques. We show empirically the effectiveness of our methods.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2906661697</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2906661697</sourcerecordid><originalsourceid>FETCH-proquest_journals_29066616973</originalsourceid><addsrcrecordid>eNqNitEKgjAYRkcQJOU7DLoezC1nXovRnQXdy5ANJst_bTqqp0_KB-jqfHznrFDCOM_I8cDYBqUh9JRSJgqW5zxB1wqGoHyUo4kK109nwc8bBqzB4wtY071w40ZzN-_fH43EjdZkcQvqKO30DXZoraUNKl24RftTfavOxHl4TCqMbQ-TH2bVspIKITJRFvy_6gMJCUAK</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2906661697</pqid></control><display><type>article</type><title>Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation</title><source>Free E- Journals</source><creator>Daoudi, Paul ; moso, Mathias ; Othman Gaizi ; Azize, Achraf ; Evrard Garcelon</creator><creatorcontrib>Daoudi, Paul ; moso, Mathias ; Othman Gaizi ; Azize, Achraf ; Evrard Garcelon</creatorcontrib><description>A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems. We leverage importance sampling techniques to counterfactually evaluate the conservative condition from the data self-generated by the algorithm. We derive a regret bound and show that (w.h.p.) the conservative constraint is never violated during learning. Finally, we leverage these insights to build a general schema for conservative exploration in DeepRL via off-policy policy evaluation techniques. We show empirically the effectiveness of our methods.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Importance sampling ; Machine learning ; Optimization</subject><ispartof>arXiv.org, 2023-12</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Daoudi, Paul</creatorcontrib><creatorcontrib>moso, Mathias</creatorcontrib><creatorcontrib>Othman Gaizi</creatorcontrib><creatorcontrib>Azize, Achraf</creatorcontrib><creatorcontrib>Evrard Garcelon</creatorcontrib><title>Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation</title><title>arXiv.org</title><description>A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems. We leverage importance sampling techniques to counterfactually evaluate the conservative condition from the data self-generated by the algorithm. We derive a regret bound and show that (w.h.p.) the conservative constraint is never violated during learning. Finally, we leverage these insights to build a general schema for conservative exploration in DeepRL via off-policy policy evaluation techniques. We show empirically the effectiveness of our methods.</description><subject>Algorithms</subject><subject>Importance sampling</subject><subject>Machine learning</subject><subject>Optimization</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNitEKgjAYRkcQJOU7DLoezC1nXovRnQXdy5ANJst_bTqqp0_KB-jqfHznrFDCOM_I8cDYBqUh9JRSJgqW5zxB1wqGoHyUo4kK109nwc8bBqzB4wtY071w40ZzN-_fH43EjdZkcQvqKO30DXZoraUNKl24RftTfavOxHl4TCqMbQ-TH2bVspIKITJRFvy_6gMJCUAK</recordid><startdate>20231224</startdate><enddate>20231224</enddate><creator>Daoudi, Paul</creator><creator>moso, Mathias</creator><creator>Othman Gaizi</creator><creator>Azize, Achraf</creator><creator>Evrard Garcelon</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231224</creationdate><title>Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation</title><author>Daoudi, Paul ; moso, Mathias ; Othman Gaizi ; Azize, Achraf ; Evrard Garcelon</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29066616973</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Importance sampling</topic><topic>Machine learning</topic><topic>Optimization</topic><toplevel>online_resources</toplevel><creatorcontrib>Daoudi, Paul</creatorcontrib><creatorcontrib>moso, Mathias</creatorcontrib><creatorcontrib>Othman Gaizi</creatorcontrib><creatorcontrib>Azize, Achraf</creatorcontrib><creatorcontrib>Evrard Garcelon</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Daoudi, Paul</au><au>moso, Mathias</au><au>Othman Gaizi</au><au>Azize, Achraf</au><au>Evrard Garcelon</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation</atitle><jtitle>arXiv.org</jtitle><date>2023-12-24</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems. We leverage importance sampling techniques to counterfactually evaluate the conservative condition from the data self-generated by the algorithm. We derive a regret bound and show that (w.h.p.) the conservative constraint is never violated during learning. Finally, we leverage these insights to build a general schema for conservative exploration in DeepRL via off-policy policy evaluation techniques. We show empirically the effectiveness of our methods.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-12
issn 2331-8422
language eng
recordid cdi_proquest_journals_2906661697
source Free E- Journals
subjects Algorithms
Importance sampling
Machine learning
Optimization
title Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T05%3A10%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Conservative%20Exploration%20for%20Policy%20Optimization%20via%20Off-Policy%20Policy%20Evaluation&rft.jtitle=arXiv.org&rft.au=Daoudi,%20Paul&rft.date=2023-12-24&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2906661697%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2906661697&rft_id=info:pmid/&rfr_iscdi=true