Offline Model-Based Adaptable Policy Learning for Decision-Making in Out-of-Support Regions

In reinforcement learning, a promising direction to avoid online trial-and-error costs is learning from an offline dataset. Current offline reinforcement learning methods commonly learn in the policy space constrained to in-support regions by the offline dataset, in order to ensure the robustness of...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence 2023-12, Vol.45 (12), p.15260-15274
Hauptverfasser:	Chen, Xiong-Hui, Luo, Fan-Ming, Yu, Yang, Li, Qingyang, Qin, Zhiwei, Shang, Wenjie, Ye, Jieping
Format:	Artikel
Sprache:	eng
Schlagworte:	Adaptable policy learning Adaptation models Algorithms Behavioral sciences Constraints Datasets Decision making Extrapolation Locomotion meta learning model-based reinforcement learning offline reinforcement learning Policies Predictive models Reinforcement learning Trajectory Uncertainty
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In reinforcement learning, a promising direction to avoid online trial-and-error costs is learning from an offline dataset. Current offline reinforcement learning methods commonly learn in the policy space constrained to in-support regions by the offline dataset, in order to ensure the robustness of the outcome policies. Such constraints, however, also limit the potential of the outcome policies. In this paper, to release the potential of offline policy learning, we investigate the decision-making problems in out-of-support regions directly and propose offline Model-based Adaptable Policy LEarning (MAPLE). By this approach, instead of learning in in-support regions, we learn an adaptable policy that can adapt its behavior in out-of-support regions when deployed. We give a practical implementation of MAPLE via meta-learning techniques and ensemble model learning techniques. We conduct experiments on MuJoCo locomotion tasks with offline datasets. The results show that the proposed method can make robust decisions in out-of-support regions and achieve better performance than SOTA algorithms.
ISSN:	0162-8828 2160-9292 1939-3539
DOI:	10.1109/TPAMI.2023.3317131