On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

We consider a totally asynchronous stochastic approximation algorithm, Q-learning, for solving finite space stochastic shortest path (SSP) problems, which are undiscounted, total cost Markov decision processes with an absorbing and cost-free state. For the most commonly used SSP models, existing con...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Mathematics of operations research 2013-05, Vol.38 (2), p.209-227
Hauptverfasser: Yu, Huizhen, Bertsekas, Dimitri P.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 227
container_issue 2
container_start_page 209
container_title Mathematics of operations research
container_volume 38
creator Yu, Huizhen
Bertsekas, Dimitri P.
description We consider a totally asynchronous stochastic approximation algorithm, Q-learning, for solving finite space stochastic shortest path (SSP) problems, which are undiscounted, total cost Markov decision processes with an absorbing and cost-free state. For the most commonly used SSP models, existing convergence proofs assume that the sequence of Q-learning iterates is bounded with probability one, or some other condition that guarantees boundedness. We prove that the sequence of iterates is naturally bounded with probability one, thus furnishing the boundedness condition in the convergence proof by Tsitsiklis [Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16:185-202] and establishing completely the convergence of Q-learning for these SSP models.
doi_str_mv 10.1287/moor.1120.0562
format Article
fullrecord <record><control><sourceid>gale_proqu</sourceid><recordid>TN_cdi_gale_infotracmisc_A332788103</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A332788103</galeid><jstor_id>24540850</jstor_id><sourcerecordid>A332788103</sourcerecordid><originalsourceid>FETCH-LOGICAL-c574t-2834a10de448e3c5947c06e6fee409d4dcf2025de4f042fb76cb77684e475a683</originalsourceid><addsrcrecordid>eNqFktuLEzEUxgdRsK6--iYEfBKcmntmH9fFS7Gwq1XwLaSZk2lKJ1mTDOh_b4Yqa6EgeQic8_vOJfma5jnBS0I79WaMMS0JoXiJhaQPmgURVLaCK_KwWWAmeauk-P64eZLzHmMiFOGL5tNNQG_jFHroA-SMokOf2zWYFHwY0KpAMgUycjGhTYl2Z3LxFm12MdVwQbem7NBtitsDjPlp88iZQ4Znf-6L5tv7d1-vP7brmw-r66t1a4XipaUd44bgHjjvgFlxyZXFEqQD4Piy5711FFNR8w5z6rZK2q1SsuPAlTCyYxfNy2PduxR_THUMvY9TCrWlJkxRxnnd7p4azAG0Dy6WZOzos9VXjFHVdQSzSrVnqAFCXfwQAzhfwyf88gxfTw-jt2cFr04ElSnwswxmylmvNl9O2df_sNsp-_lPfMh-2JV8lJybxaaYcwKn75IfTfqlCdazJfRsCT1bQs-WqIIXR8E-l5r4S1MuOO4Evn-Mea805v_V-w19t75X</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1372344015</pqid></control><display><type>article</type><title>On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems</title><source>INFORMS PubsOnLine</source><source>Business Source Complete</source><source>JSTOR Mathematics &amp; Statistics</source><source>JSTOR Archive Collection A-Z Listing</source><creator>Yu, Huizhen ; Bertsekas, Dimitri P.</creator><creatorcontrib>Yu, Huizhen ; Bertsekas, Dimitri P.</creatorcontrib><description>We consider a totally asynchronous stochastic approximation algorithm, Q-learning, for solving finite space stochastic shortest path (SSP) problems, which are undiscounted, total cost Markov decision processes with an absorbing and cost-free state. For the most commonly used SSP models, existing convergence proofs assume that the sequence of Q-learning iterates is bounded with probability one, or some other condition that guarantees boundedness. We prove that the sequence of iterates is naturally bounded with probability one, thus furnishing the boundedness condition in the convergence proof by Tsitsiklis [Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16:185-202] and establishing completely the convergence of Q-learning for these SSP models.</description><identifier>ISSN: 0364-765X</identifier><identifier>EISSN: 1526-5471</identifier><identifier>DOI: 10.1287/moor.1120.0562</identifier><identifier>CODEN: MOREDQ</identifier><language>eng</language><publisher>Linthicum: INFORMS</publisher><subject>Analysis ; Dynamic programming ; Learning ; Markov analysis ; Markov decision processes ; Markov processes ; Q-learning ; reinforcement learning ; Shortest path algorithms ; stochastic approximation ; Stochastic models ; Stochastic processes ; Studies</subject><ispartof>Mathematics of operations research, 2013-05, Vol.38 (2), p.209-227</ispartof><rights>Copyright 2013, Institute for Operations Research and the Management Sciences</rights><rights>COPYRIGHT 2013 Institute for Operations Research and the Management Sciences</rights><rights>Copyright Institute for Operations Research and the Management Sciences May 2013</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c574t-2834a10de448e3c5947c06e6fee409d4dcf2025de4f042fb76cb77684e475a683</citedby><cites>FETCH-LOGICAL-c574t-2834a10de448e3c5947c06e6fee409d4dcf2025de4f042fb76cb77684e475a683</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.jstor.org/stable/pdf/24540850$$EPDF$$P50$$Gjstor$$H</linktopdf><linktohtml>$$Uhttps://pubsonline.informs.org/doi/full/10.1287/moor.1120.0562$$EHTML$$P50$$Ginforms$$H</linktohtml><link.rule.ids>315,781,785,804,833,3693,27929,27930,58022,58026,58255,58259,62621</link.rule.ids></links><search><creatorcontrib>Yu, Huizhen</creatorcontrib><creatorcontrib>Bertsekas, Dimitri P.</creatorcontrib><title>On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems</title><title>Mathematics of operations research</title><description>We consider a totally asynchronous stochastic approximation algorithm, Q-learning, for solving finite space stochastic shortest path (SSP) problems, which are undiscounted, total cost Markov decision processes with an absorbing and cost-free state. For the most commonly used SSP models, existing convergence proofs assume that the sequence of Q-learning iterates is bounded with probability one, or some other condition that guarantees boundedness. We prove that the sequence of iterates is naturally bounded with probability one, thus furnishing the boundedness condition in the convergence proof by Tsitsiklis [Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16:185-202] and establishing completely the convergence of Q-learning for these SSP models.</description><subject>Analysis</subject><subject>Dynamic programming</subject><subject>Learning</subject><subject>Markov analysis</subject><subject>Markov decision processes</subject><subject>Markov processes</subject><subject>Q-learning</subject><subject>reinforcement learning</subject><subject>Shortest path algorithms</subject><subject>stochastic approximation</subject><subject>Stochastic models</subject><subject>Stochastic processes</subject><subject>Studies</subject><issn>0364-765X</issn><issn>1526-5471</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2013</creationdate><recordtype>article</recordtype><sourceid>N95</sourceid><recordid>eNqFktuLEzEUxgdRsK6--iYEfBKcmntmH9fFS7Gwq1XwLaSZk2lKJ1mTDOh_b4Yqa6EgeQic8_vOJfma5jnBS0I79WaMMS0JoXiJhaQPmgURVLaCK_KwWWAmeauk-P64eZLzHmMiFOGL5tNNQG_jFHroA-SMokOf2zWYFHwY0KpAMgUycjGhTYl2Z3LxFm12MdVwQbem7NBtitsDjPlp88iZQ4Znf-6L5tv7d1-vP7brmw-r66t1a4XipaUd44bgHjjvgFlxyZXFEqQD4Piy5711FFNR8w5z6rZK2q1SsuPAlTCyYxfNy2PduxR_THUMvY9TCrWlJkxRxnnd7p4azAG0Dy6WZOzos9VXjFHVdQSzSrVnqAFCXfwQAzhfwyf88gxfTw-jt2cFr04ElSnwswxmylmvNl9O2df_sNsp-_lPfMh-2JV8lJybxaaYcwKn75IfTfqlCdazJfRsCT1bQs-WqIIXR8E-l5r4S1MuOO4Evn-Mea805v_V-w19t75X</recordid><startdate>20130501</startdate><enddate>20130501</enddate><creator>Yu, Huizhen</creator><creator>Bertsekas, Dimitri P.</creator><general>INFORMS</general><general>Institute for Operations Research and the Management Sciences</general><scope>AAYXX</scope><scope>CITATION</scope><scope>N95</scope><scope>XI7</scope><scope>ISR</scope><scope>JQ2</scope></search><sort><creationdate>20130501</creationdate><title>On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems</title><author>Yu, Huizhen ; Bertsekas, Dimitri P.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c574t-2834a10de448e3c5947c06e6fee409d4dcf2025de4f042fb76cb77684e475a683</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2013</creationdate><topic>Analysis</topic><topic>Dynamic programming</topic><topic>Learning</topic><topic>Markov analysis</topic><topic>Markov decision processes</topic><topic>Markov processes</topic><topic>Q-learning</topic><topic>reinforcement learning</topic><topic>Shortest path algorithms</topic><topic>stochastic approximation</topic><topic>Stochastic models</topic><topic>Stochastic processes</topic><topic>Studies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yu, Huizhen</creatorcontrib><creatorcontrib>Bertsekas, Dimitri P.</creatorcontrib><collection>CrossRef</collection><collection>Gale Business: Insights</collection><collection>Business Insights: Essentials</collection><collection>Gale In Context: Science</collection><collection>ProQuest Computer Science Collection</collection><jtitle>Mathematics of operations research</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yu, Huizhen</au><au>Bertsekas, Dimitri P.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems</atitle><jtitle>Mathematics of operations research</jtitle><date>2013-05-01</date><risdate>2013</risdate><volume>38</volume><issue>2</issue><spage>209</spage><epage>227</epage><pages>209-227</pages><issn>0364-765X</issn><eissn>1526-5471</eissn><coden>MOREDQ</coden><abstract>We consider a totally asynchronous stochastic approximation algorithm, Q-learning, for solving finite space stochastic shortest path (SSP) problems, which are undiscounted, total cost Markov decision processes with an absorbing and cost-free state. For the most commonly used SSP models, existing convergence proofs assume that the sequence of Q-learning iterates is bounded with probability one, or some other condition that guarantees boundedness. We prove that the sequence of iterates is naturally bounded with probability one, thus furnishing the boundedness condition in the convergence proof by Tsitsiklis [Tsitsiklis JN (1994) Asynchronous stochastic approximation and Q-learning. Machine Learn. 16:185-202] and establishing completely the convergence of Q-learning for these SSP models.</abstract><cop>Linthicum</cop><pub>INFORMS</pub><doi>10.1287/moor.1120.0562</doi><tpages>19</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0364-765X
ispartof Mathematics of operations research, 2013-05, Vol.38 (2), p.209-227
issn 0364-765X
1526-5471
language eng
recordid cdi_gale_infotracmisc_A332788103
source INFORMS PubsOnLine; Business Source Complete; JSTOR Mathematics & Statistics; JSTOR Archive Collection A-Z Listing
subjects Analysis
Dynamic programming
Learning
Markov analysis
Markov decision processes
Markov processes
Q-learning
reinforcement learning
Shortest path algorithms
stochastic approximation
Stochastic models
Stochastic processes
Studies
title On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-15T06%3A22%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20Boundedness%20of%20Q-Learning%20Iterates%20for%20Stochastic%20Shortest%20Path%20Problems&rft.jtitle=Mathematics%20of%20operations%20research&rft.au=Yu,%20Huizhen&rft.date=2013-05-01&rft.volume=38&rft.issue=2&rft.spage=209&rft.epage=227&rft.pages=209-227&rft.issn=0364-765X&rft.eissn=1526-5471&rft.coden=MOREDQ&rft_id=info:doi/10.1287/moor.1120.0562&rft_dat=%3Cgale_proqu%3EA332788103%3C/gale_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1372344015&rft_id=info:pmid/&rft_galeid=A332788103&rft_jstor_id=24540850&rfr_iscdi=true