Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design
Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Most reinforcement learning practitioners evaluate their policies with online
Monte Carlo estimators for either hyperparameter tuning or testing different
algorithmic design choices, where the policy is repeatedly executed in the
environment to get the average outcome. Such massive interactions with the
environment are prohibitive in many scenarios. In this paper, we propose novel
methods that improve the data efficiency of online Monte Carlo estimators while
maintaining their unbiasedness. We first propose a tailored closed-form
behavior policy that provably reduces the variance of an online Monte Carlo
estimator. We then design efficient algorithms to learn this closed-form
behavior policy from previously collected offline data. Theoretical analysis is
provided to characterize how the behavior policy learning error affects the
amount of reduced variance. Compared with previous works, our method achieves
better empirical performance in a broader set of environments, with fewer
requirements for offline data. |
---|---|
DOI: | 10.48550/arxiv.2301.13734 |