e-COP : Episodic Constrained Optimization of Policies
In this paper, we present the $\texttt{e-COP}$ algorithm, the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings. Such formulations are applicable when there are separate sets of optimization criteria and constraints on a system'...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we present the $\texttt{e-COP}$ algorithm, the first policy
optimization algorithm for constrained Reinforcement Learning (RL) in episodic
(finite horizon) settings. Such formulations are applicable when there are
separate sets of optimization criteria and constraints on a system's behavior.
We approach this problem by first establishing a policy difference lemma for
the episodic setting, which provides the theoretical foundation for the
algorithm. Then, we propose to combine a set of established and novel solution
ideas to yield the $\texttt{e-COP}$ algorithm that is easy to implement and
numerically stable, and provide a theoretical guarantee on optimality under
certain scaling assumptions. Through extensive empirical analysis using
benchmarks in the Safety Gym suite, we show that our algorithm has similar or
better performance than SoTA (non-episodic) algorithms adapted for the episodic
setting. The scalability of the algorithm opens the door to its application in
safety-constrained Reinforcement Learning from Human Feedback for Large
Language or Diffusion Models. |
---|---|
DOI: | 10.48550/arxiv.2406.09563 |