On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes
We present the first finite time global convergence analysis of policy gradient in the context of infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present the first finite time global convergence analysis of policy
gradient in the context of infinite horizon average reward Markov decision
processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite
state and action spaces. Our analysis shows that the policy gradient iterates
converge to the optimal policy at a sublinear rate of
$O\left({\frac{1}{T}}\right),$ which translates to $O\left({\log(T)}\right)$
regret, where $T$ represents the number of iterations. Prior work on
performance bounds for discounted reward MDPs cannot be extended to average
reward MDPs because the bounds grow proportional to the fifth power of the
effective horizon. Thus, our primary contribution is in proving that the policy
gradient algorithm converges for average-reward MDPs and in obtaining
finite-time performance guarantees. In contrast to the existing discounted
reward performance bounds, our performance bounds have an explicit dependence
on constants that capture the complexity of the underlying MDP. Motivated by
this observation, we reexamine and improve the existing performance bounds for
discounted reward MDPs. We also present simulations to empirically evaluate the
performance of average reward policy gradient algorithm. |
---|---|
DOI: | 10.48550/arxiv.2403.06806 |