Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Liu, Shuze, Zhang, Shangtong
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Liu, Shuze Zhang, Shangtong
description	Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.
doi_str_mv	10.48550/arxiv.2301.13734
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2301_13734</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2301_13734</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2301_137343</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjYw1DM0Njc24WTwdU1Ly0zOTM0rUQjIz8lMrlRwLUvMKU0syczPUyjPLMlQ8E9Ly8nMS1VwSSxJVPDMS8svyk1NUXBKzUgsy8wvgulySS3OTM_jYWBNS8wpTuWF0twM8m6uIc4eumCL4wuKMnMTiyrjQQ6IBzvAmLAKABFLOyU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design</title><source>arXiv.org</source><creator>Liu, Shuze ; Zhang, Shangtong</creator><creatorcontrib>Liu, Shuze ; Zhang, Shangtong</creatorcontrib><description>Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.</description><identifier>DOI: 10.48550/arxiv.2301.13734</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2023-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2301.13734$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2301.13734$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Liu, Shuze</creatorcontrib><creatorcontrib>Zhang, Shangtong</creatorcontrib><title>Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design</title><description>Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjYw1DM0Njc24WTwdU1Ly0zOTM0rUQjIz8lMrlRwLUvMKU0syczPUyjPLMlQ8E9Ly8nMS1VwSSxJVPDMS8svyk1NUXBKzUgsy8wvgulySS3OTM_jYWBNS8wpTuWF0twM8m6uIc4eumCL4wuKMnMTiyrjQQ6IBzvAmLAKABFLOyU</recordid><startdate>20230131</startdate><enddate>20230131</enddate><creator>Liu, Shuze</creator><creator>Zhang, Shangtong</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230131</creationdate><title>Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design</title><author>Liu, Shuze ; Zhang, Shangtong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2301_137343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Liu, Shuze</creatorcontrib><creatorcontrib>Zhang, Shangtong</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Liu, Shuze</au><au>Zhang, Shangtong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design</atitle><date>2023-01-31</date><risdate>2023</risdate><abstract>Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.</abstract><doi>10.48550/arxiv.2301.13734</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2301.13734
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2301_13734
source	arXiv.org
subjects	Computer Science - Learning
title	Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-12T13%3A13%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Efficient%20Policy%20Evaluation%20with%20Offline%20Data%20Informed%20Behavior%20Policy%20Design&rft.au=Liu,%20Shuze&rft.date=2023-01-31&rft_id=info:doi/10.48550/arxiv.2301.13734&rft_dat=%3Carxiv_GOX%3E2301_13734%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true