A Convergent Off-Policy Temporal Difference Algorithm
Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorit...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2019-11 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Raghuram Bharadwaj Diddigi Kamanchi, Chandramouli Bhatnagar, Shalabh |
description | Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on-policy prediction). However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge. In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2314444684</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2314444684</sourcerecordid><originalsourceid>FETCH-proquest_journals_23144446843</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQwdVRwzs8rSy1KT80rUfBPS9MNyM_JTK5UCEnNLcgvSsxRcMlMS0stSs1LTlVwzEnPL8osycjlYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4I2NDEyAwszAxJk4VAFiBNJQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2314444684</pqid></control><display><type>article</type><title>A Convergent Off-Policy Temporal Difference Algorithm</title><source>Free E- Journals</source><creator>Raghuram Bharadwaj Diddigi ; Kamanchi, Chandramouli ; Bhatnagar, Shalabh</creator><creatorcontrib>Raghuram Bharadwaj Diddigi ; Kamanchi, Chandramouli ; Bhatnagar, Shalabh</creatorcontrib><description>Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on-policy prediction). However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge. In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Approximation ; Convergence ; Linear functions ; Machine learning ; Mathematical analysis</subject><ispartof>arXiv.org, 2019-11</ispartof><rights>2019. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>778,782</link.rule.ids></links><search><creatorcontrib>Raghuram Bharadwaj Diddigi</creatorcontrib><creatorcontrib>Kamanchi, Chandramouli</creatorcontrib><creatorcontrib>Bhatnagar, Shalabh</creatorcontrib><title>A Convergent Off-Policy Temporal Difference Algorithm</title><title>arXiv.org</title><description>Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on-policy prediction). However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge. In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm.</description><subject>Algorithms</subject><subject>Approximation</subject><subject>Convergence</subject><subject>Linear functions</subject><subject>Machine learning</subject><subject>Mathematical analysis</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQwdVRwzs8rSy1KT80rUfBPS9MNyM_JTK5UCEnNLcgvSsxRcMlMS0stSs1LTlVwzEnPL8osycjlYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4I2NDEyAwszAxJk4VAFiBNJQ</recordid><startdate>20191113</startdate><enddate>20191113</enddate><creator>Raghuram Bharadwaj Diddigi</creator><creator>Kamanchi, Chandramouli</creator><creator>Bhatnagar, Shalabh</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20191113</creationdate><title>A Convergent Off-Policy Temporal Difference Algorithm</title><author>Raghuram Bharadwaj Diddigi ; Kamanchi, Chandramouli ; Bhatnagar, Shalabh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_23144446843</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Algorithms</topic><topic>Approximation</topic><topic>Convergence</topic><topic>Linear functions</topic><topic>Machine learning</topic><topic>Mathematical analysis</topic><toplevel>online_resources</toplevel><creatorcontrib>Raghuram Bharadwaj Diddigi</creatorcontrib><creatorcontrib>Kamanchi, Chandramouli</creatorcontrib><creatorcontrib>Bhatnagar, Shalabh</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Raghuram Bharadwaj Diddigi</au><au>Kamanchi, Chandramouli</au><au>Bhatnagar, Shalabh</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>A Convergent Off-Policy Temporal Difference Algorithm</atitle><jtitle>arXiv.org</jtitle><date>2019-11-13</date><risdate>2019</risdate><eissn>2331-8422</eissn><abstract>Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on-policy prediction). However, it has been well established in the literature that off-policy TD algorithms under linear function approximation diverge. In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2019-11 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2314444684 |
source | Free E- Journals |
subjects | Algorithms Approximation Convergence Linear functions Machine learning Mathematical analysis |
title | A Convergent Off-Policy Temporal Difference Algorithm |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T04%3A18%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=A%20Convergent%20Off-Policy%20Temporal%20Difference%20Algorithm&rft.jtitle=arXiv.org&rft.au=Raghuram%20Bharadwaj%20Diddigi&rft.date=2019-11-13&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2314444684%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2314444684&rft_id=info:pmid/&rfr_iscdi=true |