Conservative Exploration for Policy Optimization via Off-Policy Policy Evaluation
A precondition for the deployment of a Reinforcement Learning agent to a real-world system is to provide guarantees on the learning process. While a learning algorithm will eventually converge to a good policy, there are no guarantees on the performance of the exploratory policies. We study the prob...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A precondition for the deployment of a Reinforcement Learning agent to a
real-world system is to provide guarantees on the learning process. While a
learning algorithm will eventually converge to a good policy, there are no
guarantees on the performance of the exploratory policies. We study the problem
of conservative exploration, where the learner must at least be able to
guarantee its performance is at least as good as a baseline policy. We propose
the first conservative provably efficient model-free algorithm for policy
optimization in continuous finite-horizon problems. We leverage importance
sampling techniques to counterfactually evaluate the conservative condition
from the data self-generated by the algorithm. We derive a regret bound and
show that (w.h.p.) the conservative constraint is never violated during
learning. Finally, we leverage these insights to build a general schema for
conservative exploration in DeepRL via off-policy policy evaluation techniques.
We show empirically the effectiveness of our methods. |
---|---|
DOI: | 10.48550/arxiv.2312.15458 |