A Convergent Off-Policy Temporal Difference Algorithm

Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorit...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2019-11
Hauptverfasser:	Raghuram Bharadwaj Diddigi, Kamanchi, Chandramouli, Bhatnagar, Shalabh
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Approximation Convergence Linear functions Machine learning Mathematical analysis
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Raghuram Bharadwaj Diddigi Kamanchi, Chandramouli Bhatnagar, Shalabh
description	Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on-policy prediction). However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge. In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2314444684</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2314444684</sourcerecordid><originalsourceid>FETCH-proquest_journals_23144446843</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQwdVRwzs8rSy1KT80rUfBPS9MNyM_JTK5UCEnNLcgvSsxRcMlMS0stSs1LTlVwzEnPL8osycjlYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4I2NDEyAwszAxJk4VAFiBNJQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2314444684</pqid></control><display><type>article</type><title>A Convergent Off-Policy Temporal Difference Algorithm</title><source>Free E- Journals</source><creator>Raghuram Bharadwaj Diddigi ; Kamanchi, Chandramouli ; Bhatnagar, Shalabh</creator><creatorcontrib>Raghuram Bharadwaj Diddigi ; Kamanchi, Chandramouli ; Bhatnagar, Shalabh</creatorcontrib><description>Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on-policy prediction). However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge. In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Approximation ; Convergence ; Linear functions ; Machine learning ; Mathematical analysis</subject><ispartof>arXiv.org, 2019-11</ispartof><rights>2019. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>778,782</link.rule.ids></links><search><creatorcontrib>Raghuram Bharadwaj Diddigi</creatorcontrib><creatorcontrib>Kamanchi, Chandramouli</creatorcontrib><creatorcontrib>Bhatnagar, Shalabh</creatorcontrib><title>A Convergent Off-Policy Temporal Difference Algorithm</title><title>arXiv.org</title><description>Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on-policy prediction). However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge. In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm.</description><subject>Algorithms</subject><subject>Approximation</subject><subject>Convergence</subject><subject>Linear functions</subject><subject>Machine learning</subject><subject>Mathematical analysis</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQwdVRwzs8rSy1KT80rUfBPS9MNyM_JTK5UCEnNLcgvSsxRcMlMS0stSs1LTlVwzEnPL8osycjlYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4I2NDEyAwszAxJk4VAFiBNJQ</recordid><startdate>20191113</startdate><enddate>20191113</enddate><creator>Raghuram Bharadwaj Diddigi</creator><creator>Kamanchi, Chandramouli</creator><creator>Bhatnagar, Shalabh</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20191113</creationdate><title>A Convergent Off-Policy Temporal Difference Algorithm</title><author>Raghuram Bharadwaj Diddigi ; Kamanchi, Chandramouli ; Bhatnagar, Shalabh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_23144446843</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Algorithms</topic><topic>Approximation</topic><topic>Convergence</topic><topic>Linear functions</topic><topic>Machine learning</topic><topic>Mathematical analysis</topic><toplevel>online_resources</toplevel><creatorcontrib>Raghuram Bharadwaj Diddigi</creatorcontrib><creatorcontrib>Kamanchi, Chandramouli</creatorcontrib><creatorcontrib>Bhatnagar, Shalabh</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Raghuram Bharadwaj Diddigi</au><au>Kamanchi, Chandramouli</au><au>Bhatnagar, Shalabh</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>A Convergent Off-Policy Temporal Difference Algorithm</atitle><jtitle>arXiv.org</jtitle><date>2019-11-13</date><risdate>2019</risdate><eissn>2331-8422</eissn><abstract>Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on-policy prediction). However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge. In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2019-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2314444684
source	Free E- Journals
subjects	Algorithms Approximation Convergence Linear functions Machine learning Mathematical analysis
title	A Convergent Off-Policy Temporal Difference Algorithm
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T04%3A18%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=A%20Convergent%20Off-Policy%20Temporal%20Difference%20Algorithm&rft.jtitle=arXiv.org&rft.au=Raghuram%20Bharadwaj%20Diddigi&rft.date=2019-11-13&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2314444684%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2314444684&rft_id=info:pmid/&rfr_iscdi=true