Generalized Off-Policy Actor-Critic
We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance of the target polic...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We propose a new objective, the counterfactual objective, unifying existing
objectives for off-policy policy gradient algorithms in the continuing
reinforcement learning (RL) setting. Compared to the commonly used excursion
objective, which can be misleading about the performance of the target policy
when deployed, our new objective better predicts such performance. We prove the
Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient
of the counterfactual objective and use an emphatic approach to get an unbiased
sample from this policy gradient, yielding the Generalized Off-Policy
Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over
existing algorithms in Mujoco robot simulation tasks, the first empirical
success of emphatic algorithms in prevailing deep RL benchmarks. |
---|---|
DOI: | 10.48550/arxiv.1903.11329 |