Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Liu, Shuze, Zhang, Shangtong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Liu, Shuze
Zhang, Shangtong
description Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.
doi_str_mv 10.48550/arxiv.2301.13734
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2301_13734</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2301_13734</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2301_137343</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjYw1DM0Njc24WTwdU1Ly0zOTM0rUQjIz8lMrlRwLUvMKU0syczPUyjPLMlQ8E9Ly8nMS1VwSSxJVPDMS8svyk1NUXBKzUgsy8wvgulySS3OTM_jYWBNS8wpTuWF0twM8m6uIc4eumCL4wuKMnMTiyrjQQ6IBzvAmLAKABFLOyU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design</title><source>arXiv.org</source><creator>Liu, Shuze ; Zhang, Shangtong</creator><creatorcontrib>Liu, Shuze ; Zhang, Shangtong</creatorcontrib><description>Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.</description><identifier>DOI: 10.48550/arxiv.2301.13734</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2023-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2301.13734$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2301.13734$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Liu, Shuze</creatorcontrib><creatorcontrib>Zhang, Shangtong</creatorcontrib><title>Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design</title><description>Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjYw1DM0Njc24WTwdU1Ly0zOTM0rUQjIz8lMrlRwLUvMKU0syczPUyjPLMlQ8E9Ly8nMS1VwSSxJVPDMS8svyk1NUXBKzUgsy8wvgulySS3OTM_jYWBNS8wpTuWF0twM8m6uIc4eumCL4wuKMnMTiyrjQQ6IBzvAmLAKABFLOyU</recordid><startdate>20230131</startdate><enddate>20230131</enddate><creator>Liu, Shuze</creator><creator>Zhang, Shangtong</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230131</creationdate><title>Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design</title><author>Liu, Shuze ; Zhang, Shangtong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2301_137343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Liu, Shuze</creatorcontrib><creatorcontrib>Zhang, Shangtong</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Liu, Shuze</au><au>Zhang, Shangtong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design</atitle><date>2023-01-31</date><risdate>2023</risdate><abstract>Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.</abstract><doi>10.48550/arxiv.2301.13734</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2301.13734
ispartof
issn
language eng
recordid cdi_arxiv_primary_2301_13734
source arXiv.org
subjects Computer Science - Learning
title Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-12T13%3A13%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Efficient%20Policy%20Evaluation%20with%20Offline%20Data%20Informed%20Behavior%20Policy%20Design&rft.au=Liu,%20Shuze&rft.date=2023-01-31&rft_id=info:doi/10.48550/arxiv.2301.13734&rft_dat=%3Carxiv_GOX%3E2301_13734%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true