Off-Policy Primal-Dual Safe Reinforcement Learning
Primal-dual safe RL methods commonly perform iterations between the primal update of the policy and the dual update of the Lagrange Multiplier. Such a training paradigm is highly susceptible to the error in cumulative cost estimation since this estimation serves as the key bond connecting the primal...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Primal-dual safe RL methods commonly perform iterations between the primal
update of the policy and the dual update of the Lagrange Multiplier. Such a
training paradigm is highly susceptible to the error in cumulative cost
estimation since this estimation serves as the key bond connecting the primal
and dual update processes. We show that this problem causes significant
underestimation of cost when using off-policy methods, leading to the failure
to satisfy the safety constraint. To address this issue, we propose
conservative policy optimization, which learns a policy in a
constraint-satisfying area by considering the uncertainty in cost estimation.
This improves constraint satisfaction but also potentially hinders reward
maximization. We then introduce local policy convexification to help eliminate
such suboptimality by gradually reducing the estimation uncertainty. We provide
theoretical interpretations of the joint coupling effect of these two
ingredients and further verify them by extensive experiments. Results on
benchmark tasks show that our method not only achieves an asymptotic
performance comparable to state-of-the-art on-policy methods while using much
fewer samples, but also significantly reduces constraint violation during
training. Our code is available at https://github.com/ZifanWu/CAL. |
---|---|
DOI: | 10.48550/arxiv.2401.14758 |