Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation

A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the prob...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-12
Hauptverfasser:	Daoudi, Paul, moso, Mathias, Othman Gaizi, Azize, Achraf, Evrard Garcelon
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Importance sampling Machine learning Optimization
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Daoudi, Paul moso, Mathias Othman Gaizi Azize, Achraf Evrard Garcelon
description	A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems. We leverage importance sampling techniques to counterfactually evaluate the conservative condition from the data self-generated by the algorithm. We derive a regret bound and show that (w.h.p.) the conservative constraint is never violated during learning. Finally, we leverage these insights to build a general schema for conservative exploration in DeepRL via off-policy policy evaluation techniques. We show empirically the effectiveness of our methods.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2906661697</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2906661697</sourcerecordid><originalsourceid>FETCH-proquest_journals_29066616973</originalsourceid><addsrcrecordid>eNqNitEKgjAYRkcQJOU7DLoezC1nXovRnQXdy5ANJst_bTqqp0_KB-jqfHznrFDCOM_I8cDYBqUh9JRSJgqW5zxB1wqGoHyUo4kK109nwc8bBqzB4wtY071w40ZzN-_fH43EjdZkcQvqKO30DXZoraUNKl24RftTfavOxHl4TCqMbQ-TH2bVspIKITJRFvy_6gMJCUAK</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2906661697</pqid></control><display><type>article</type><title>Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation</title><source>Free E- Journals</source><creator>Daoudi, Paul ; moso, Mathias ; Othman Gaizi ; Azize, Achraf ; Evrard Garcelon</creator><creatorcontrib>Daoudi, Paul ; moso, Mathias ; Othman Gaizi ; Azize, Achraf ; Evrard Garcelon</creatorcontrib><description>A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems. We leverage importance sampling techniques to counterfactually evaluate the conservative condition from the data self-generated by the algorithm. We derive a regret bound and show that (w.h.p.) the conservative constraint is never violated during learning. Finally, we leverage these insights to build a general schema for conservative exploration in DeepRL via off-policy policy evaluation techniques. We show empirically the effectiveness of our methods.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Importance sampling ; Machine learning ; Optimization</subject><ispartof>arXiv.org, 2023-12</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Daoudi, Paul</creatorcontrib><creatorcontrib>moso, Mathias</creatorcontrib><creatorcontrib>Othman Gaizi</creatorcontrib><creatorcontrib>Azize, Achraf</creatorcontrib><creatorcontrib>Evrard Garcelon</creatorcontrib><title>Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation</title><title>arXiv.org</title><description>A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems. We leverage importance sampling techniques to counterfactually evaluate the conservative condition from the data self-generated by the algorithm. We derive a regret bound and show that (w.h.p.) the conservative constraint is never violated during learning. Finally, we leverage these insights to build a general schema for conservative exploration in DeepRL via off-policy policy evaluation techniques. We show empirically the effectiveness of our methods.</description><subject>Algorithms</subject><subject>Importance sampling</subject><subject>Machine learning</subject><subject>Optimization</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNitEKgjAYRkcQJOU7DLoezC1nXovRnQXdy5ANJst_bTqqp0_KB-jqfHznrFDCOM_I8cDYBqUh9JRSJgqW5zxB1wqGoHyUo4kK109nwc8bBqzB4wtY071w40ZzN-_fH43EjdZkcQvqKO30DXZoraUNKl24RftTfavOxHl4TCqMbQ-TH2bVspIKITJRFvy_6gMJCUAK</recordid><startdate>20231224</startdate><enddate>20231224</enddate><creator>Daoudi, Paul</creator><creator>moso, Mathias</creator><creator>Othman Gaizi</creator><creator>Azize, Achraf</creator><creator>Evrard Garcelon</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231224</creationdate><title>Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation</title><author>Daoudi, Paul ; moso, Mathias ; Othman Gaizi ; Azize, Achraf ; Evrard Garcelon</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29066616973</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Importance sampling</topic><topic>Machine learning</topic><topic>Optimization</topic><toplevel>online_resources</toplevel><creatorcontrib>Daoudi, Paul</creatorcontrib><creatorcontrib>moso, Mathias</creatorcontrib><creatorcontrib>Othman Gaizi</creatorcontrib><creatorcontrib>Azize, Achraf</creatorcontrib><creatorcontrib>Evrard Garcelon</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Daoudi, Paul</au><au>moso, Mathias</au><au>Othman Gaizi</au><au>Azize, Achraf</au><au>Evrard Garcelon</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation</atitle><jtitle>arXiv.org</jtitle><date>2023-12-24</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the problem of conservative exploration, where the learner must at least be able to guarantee its performance is at least as good as a baseline policy. We propose the first conservative provably efficient model-free algorithm for policy optimization in continuous finite-horizon problems. We leverage importance sampling techniques to counterfactually evaluate the conservative condition from the data self-generated by the algorithm. We derive a regret bound and show that (w.h.p.) the conservative constraint is never violated during learning. Finally, we leverage these insights to build a general schema for conservative exploration in DeepRL via off-policy policy evaluation techniques. We show empirically the effectiveness of our methods.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-12
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2906661697
source	Free E- Journals
subjects	Algorithms Importance sampling Machine learning Optimization
title	Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T05%3A10%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Conservative%20Exploration%20for%20Policy%20Optimization%20via%20Off-Policy%20Policy%20Evaluation&rft.jtitle=arXiv.org&rft.au=Daoudi,%20Paul&rft.date=2023-12-24&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2906661697%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2906661697&rft_id=info:pmid/&rfr_iscdi=true