On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes

We present the first finite time global convergence analysis of policy gradient in the context of infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Kumar, Navdeep, Murthy, Yashaswini, Shufaro, Itai, Levy, Kfir Y, Srikant, R, Mannor, Shie
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Computer Science - Systems and Control
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Kumar, Navdeep Murthy, Yashaswini Shufaro, Itai Levy, Kfir Y Srikant, R Mannor, Shie
description	We present the first finite time global convergence analysis of policy gradient in the context of infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $O\left({\frac{1}{T}}\right),$ which translates to $O\left({\log(T)}\right)$ regret, where $T$ represents the number of iterations. Prior work on performance bounds for discounted reward MDPs cannot be extended to average reward MDPs because the bounds grow proportional to the fifth power of the effective horizon. Thus, our primary contribution is in proving that the policy gradient algorithm converges for average-reward MDPs and in obtaining finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations to empirically evaluate the performance of average reward policy gradient algorithm.
doi_str_mv	10.48550/arxiv.2403.06806
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2403_06806</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2403_06806</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-ae56135c333a9d1a7083b4cd01a35b0febd3faa127251d08fb7d18da88fddd693</originalsourceid><addsrcrecordid>eNotz7FOwzAUhWEvDKjwAEzcF0iw49hxxypAQCpqhTqwRTf2dbEINnKqQN8eKExn-KUjfYxdCV7WRil-g_krzGVVc1lybbg-Zy-bCIdXgm5MA47QpjhT3lO0BMnDNo3BHqHL6ALFA4QIq5-Oe4Jn-sTs4AnzW5rhlmyYQoqwzcnSNNF0wc48jhNd_u-C7e7vdu1Dsd50j-1qXaBudIGktJDKSilx6QQ23Mihto4LlGrgngYnPaKomkoJx40fGieMQ2O8c04v5YJd_92eaP1HDu-Yj_0vsT8R5TeZq0yD</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes</title><source>arXiv.org</source><creator>Kumar, Navdeep ; Murthy, Yashaswini ; Shufaro, Itai ; Levy, Kfir Y ; Srikant, R ; Mannor, Shie</creator><creatorcontrib>Kumar, Navdeep ; Murthy, Yashaswini ; Shufaro, Itai ; Levy, Kfir Y ; Srikant, R ; Mannor, Shie</creatorcontrib><description>We present the first finite time global convergence analysis of policy gradient in the context of infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $O\left({\frac{1}{T}}\right),$ which translates to $O\left({\log(T)}\right)$ regret, where $T$ represents the number of iterations. Prior work on performance bounds for discounted reward MDPs cannot be extended to average reward MDPs because the bounds grow proportional to the fifth power of the effective horizon. Thus, our primary contribution is in proving that the policy gradient algorithm converges for average-reward MDPs and in obtaining finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations to empirically evaluate the performance of average reward policy gradient algorithm.</description><identifier>DOI: 10.48550/arxiv.2403.06806</identifier><language>eng</language><subject>Computer Science - Learning ; Computer Science - Systems and Control</subject><creationdate>2024-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2403.06806$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.06806$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Kumar, Navdeep</creatorcontrib><creatorcontrib>Murthy, Yashaswini</creatorcontrib><creatorcontrib>Shufaro, Itai</creatorcontrib><creatorcontrib>Levy, Kfir Y</creatorcontrib><creatorcontrib>Srikant, R</creatorcontrib><creatorcontrib>Mannor, Shie</creatorcontrib><title>On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes</title><description>We present the first finite time global convergence analysis of policy gradient in the context of infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $O\left({\frac{1}{T}}\right),$ which translates to $O\left({\log(T)}\right)$ regret, where $T$ represents the number of iterations. Prior work on performance bounds for discounted reward MDPs cannot be extended to average reward MDPs because the bounds grow proportional to the fifth power of the effective horizon. Thus, our primary contribution is in proving that the policy gradient algorithm converges for average-reward MDPs and in obtaining finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations to empirically evaluate the performance of average reward policy gradient algorithm.</description><subject>Computer Science - Learning</subject><subject>Computer Science - Systems and Control</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7FOwzAUhWEvDKjwAEzcF0iw49hxxypAQCpqhTqwRTf2dbEINnKqQN8eKExn-KUjfYxdCV7WRil-g_krzGVVc1lybbg-Zy-bCIdXgm5MA47QpjhT3lO0BMnDNo3BHqHL6ALFA4QIq5-Oe4Jn-sTs4AnzW5rhlmyYQoqwzcnSNNF0wc48jhNd_u-C7e7vdu1Dsd50j-1qXaBudIGktJDKSilx6QQ23Mihto4LlGrgngYnPaKomkoJx40fGieMQ2O8c04v5YJd_92eaP1HDu-Yj_0vsT8R5TeZq0yD</recordid><startdate>20240311</startdate><enddate>20240311</enddate><creator>Kumar, Navdeep</creator><creator>Murthy, Yashaswini</creator><creator>Shufaro, Itai</creator><creator>Levy, Kfir Y</creator><creator>Srikant, R</creator><creator>Mannor, Shie</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240311</creationdate><title>On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes</title><author>Kumar, Navdeep ; Murthy, Yashaswini ; Shufaro, Itai ; Levy, Kfir Y ; Srikant, R ; Mannor, Shie</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-ae56135c333a9d1a7083b4cd01a35b0febd3faa127251d08fb7d18da88fddd693</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><topic>Computer Science - Systems and Control</topic><toplevel>online_resources</toplevel><creatorcontrib>Kumar, Navdeep</creatorcontrib><creatorcontrib>Murthy, Yashaswini</creatorcontrib><creatorcontrib>Shufaro, Itai</creatorcontrib><creatorcontrib>Levy, Kfir Y</creatorcontrib><creatorcontrib>Srikant, R</creatorcontrib><creatorcontrib>Mannor, Shie</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kumar, Navdeep</au><au>Murthy, Yashaswini</au><au>Shufaro, Itai</au><au>Levy, Kfir Y</au><au>Srikant, R</au><au>Mannor, Shie</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes</atitle><date>2024-03-11</date><risdate>2024</risdate><abstract>We present the first finite time global convergence analysis of policy gradient in the context of infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $O\left({\frac{1}{T}}\right),$ which translates to $O\left({\log(T)}\right)$ regret, where $T$ represents the number of iterations. Prior work on performance bounds for discounted reward MDPs cannot be extended to average reward MDPs because the bounds grow proportional to the fifth power of the effective horizon. Thus, our primary contribution is in proving that the policy gradient algorithm converges for average-reward MDPs and in obtaining finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations to empirically evaluate the performance of average reward policy gradient algorithm.</abstract><doi>10.48550/arxiv.2403.06806</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2403.06806
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2403_06806
source	arXiv.org
subjects	Computer Science - Learning Computer Science - Systems and Control
title	On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T00%3A48%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=On%20the%20Global%20Convergence%20of%20Policy%20Gradient%20in%20Average%20Reward%20Markov%20Decision%20Processes&rft.au=Kumar,%20Navdeep&rft.date=2024-03-11&rft_id=info:doi/10.48550/arxiv.2403.06806&rft_dat=%3Carxiv_GOX%3E2403_06806%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true