An Off-policy Policy Gradient Theorem Using Emphatic Weightings
Policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy setting, due to the existence of the policy gradient theorem which provides a simplified...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Policy gradient methods are widely used for control in reinforcement
learning, particularly for the continuous action setting. There have been a
host of theoretically sound algorithms proposed for the on-policy setting, due
to the existence of the policy gradient theorem which provides a simplified
form for the gradient. In off-policy learning, however, where the behaviour
policy is not necessarily attempting to learn and follow the optimal policy for
the given task, the existence of such a theorem has been elusive. In this work,
we solve this open problem by providing the first off-policy policy gradient
theorem. The key to the derivation is the use of $emphatic$ $weightings$. We
develop a new actor-critic algorithm$\unicode{x2014}$called Actor Critic with
Emphatic weightings (ACE)$\unicode{x2014}$that approximates the simplified
gradients provided by the theorem. We demonstrate in a simple counterexample
that previous off-policy policy gradient methods$\unicode{x2014}$particularly
OffPAC and DPG$\unicode{x2014}$converge to the wrong solution whereas ACE finds
the optimal solution. |
---|---|
DOI: | 10.48550/arxiv.1811.09013 |