Beyond the Boundaries of Proximal Policy Optimization
Proximal policy optimization (PPO) is a widely-used algorithm for on-policy reinforcement learning. This work offers an alternative perspective of PPO, in which it is decomposed into the inner-loop estimation of update vectors, and the outer-loop application of updates using gradient ascent with uni...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Proximal policy optimization (PPO) is a widely-used algorithm for on-policy
reinforcement learning. This work offers an alternative perspective of PPO, in
which it is decomposed into the inner-loop estimation of update vectors, and
the outer-loop application of updates using gradient ascent with unity learning
rate. Using this insight we propose outer proximal policy optimization
(outer-PPO); a framework wherein these update vectors are applied using an
arbitrary gradient-based optimizer. The decoupling of update estimation and
update application enabled by outer-PPO highlights several implicit design
choices in PPO that we challenge through empirical investigation. In particular
we consider non-unity learning rates and momentum applied to the outer loop,
and a momentum-bias applied to the inner estimation loop. Methods are evaluated
against an aggressively tuned PPO baseline on Brax, Jumanji and MinAtar
environments; non-unity learning rates and momentum both achieve statistically
significant improvement on Brax and Jumanji, given the same hyperparameter
tuning budget. |
---|---|
DOI: | 10.48550/arxiv.2411.00666 |