Conservative State Value Estimation for Offline Reinforcement Learning
Offline reinforcement learning faces a significant challenge of value over-estimation due to the distributional drift between the dataset and the current learned policy, leading to learning failure in practice. The common approach is to incorporate a penalty term to reward or value estimation in the...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Offline reinforcement learning faces a significant challenge of value
over-estimation due to the distributional drift between the dataset and the
current learned policy, leading to learning failure in practice. The common
approach is to incorporate a penalty term to reward or value estimation in the
Bellman iterations. Meanwhile, to avoid extrapolation on out-of-distribution
(OOD) states and actions, existing methods focus on conservative Q-function
estimation. In this paper, we propose Conservative State Value Estimation
(CSVE), a new approach that learns conservative V-function via directly
imposing penalty on OOD states. Compared to prior work, CSVE allows more
effective state value estimation with conservative guarantees and further
better policy optimization. Further, we apply CSVE and develop a practical
actor-critic algorithm in which the critic does the conservative value
estimation by additionally sampling and penalizing the states \emph{around} the
dataset, and the actor applies advantage weighted updates extended with state
exploration to improve the policy. We evaluate in classic continual control
tasks of D4RL, showing that our method performs better than the conservative
Q-function learning methods and is strongly competitive among recent SOTA
methods. |
---|---|
DOI: | 10.48550/arxiv.2302.06884 |