Policy Gradient with Active Importance Sampling
Importance sampling (IS) represents a fundamental technique for a large surge of off-policy reinforcement learning approaches. Policy gradient (PG) methods, in particular, significantly benefit from IS, enabling the effective reuse of previously collected samples, thus increasing sample efficiency....
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Importance sampling (IS) represents a fundamental technique for a large surge
of off-policy reinforcement learning approaches. Policy gradient (PG) methods,
in particular, significantly benefit from IS, enabling the effective reuse of
previously collected samples, thus increasing sample efficiency. However,
classically, IS is employed in RL as a passive tool for re-weighting historical
samples. However, the statistical community employs IS as an active tool
combined with the use of behavioral distributions that allow the reduction of
the estimate variance even below the sample mean one. In this paper, we focus
on this second setting by addressing the behavioral policy optimization (BPO)
problem. We look for the best behavioral policy from which to collect samples
to reduce the policy gradient variance as much as possible. We provide an
iterative algorithm that alternates between the cross-entropy estimation of the
minimum-variance behavioral policy and the actual policy optimization,
leveraging on defensive IS. We theoretically analyze such an algorithm, showing
that it enjoys a convergence rate of order $O(\epsilon^{-4})$ to a stationary
point, but depending on a more convenient variance term w.r.t. standard PG
methods. We then provide a practical version that is numerically validated,
showing the advantages in the policy gradient estimation variance and on the
learning speed. |
---|---|
DOI: | 10.48550/arxiv.2405.05630 |