Policy Gradient with Kernel Quadrature
Reward evaluation of episodes becomes a bottleneck in a broad range of reinforcement learning tasks. Our aim in this paper is to select a small but representative subset of a large batch of episodes, only on which we actually compute rewards for more efficient policy gradient iterations. We build a...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Reward evaluation of episodes becomes a bottleneck in a broad range of
reinforcement learning tasks. Our aim in this paper is to select a small but
representative subset of a large batch of episodes, only on which we actually
compute rewards for more efficient policy gradient iterations. We build a
Gaussian process modeling of discounted returns or rewards to derive a positive
definite kernel on the space of episodes, run an ``episodic" kernel quadrature
method to compress the information of sample episodes, and pass the reduced
episodes to the policy network for gradient updates. We present the theoretical
background of this procedure as well as its numerical illustrations in MuJoCo
tasks. |
---|---|
DOI: | 10.48550/arxiv.2310.14768 |