Mild Policy Evaluation for Offline Actor-Critic

In offline actor-critic (AC) algorithms, the distributional shift between the training data and target policy causes optimistic Q value estimates for out-of-distribution (OOD) actions. This leads to learned policies skewed toward OOD actions with falsely high Q values. The existing value-regular...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transaction on neural networks and learning systems 2023-09, Vol.35 (12), p.17950-17964
Hauptverfasser:	Huang, Longyang, Dong, Botao, Lu, Jinhui, Zhang, Weidong
Format:	Artikel
Sprache:	eng
Schlagworte:	Behavioral sciences Convergence Estimation Markov processes Mild policy evaluation (MPE) offline actor–critic (AC) offline reinforcement learning (RL) Training Training data Trajectory value function difference
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In offline actor-critic (AC) algorithms, the distributional shift between the training data and target policy causes optimistic Q value estimates for out-of-distribution (OOD) actions. This leads to learned policies skewed toward OOD actions with falsely high Q values. The existing value-regularized offline AC algorithms address this issue by learning a conservative value function, leading to a performance drop. In this article, we propose a mild policy evaluation (MPE) by constraining the difference between the Q values of actions supported by the target policy and those of actions contained within the offline dataset. The convergence of the proposed MPE, the gap between the learned value function and the true one, and the suboptimality of the offline AC with MPE are analyzed, respectively. A mild offline AC (MOAC) algorithm is developed by integrating MPE into off-policy AC. Compared with existing offline AC algorithms, the value function gap of MOAC is bounded by the existence of sampling errors. Moreover, in the absence of sampling errors, the true state value function can be obtained. Experimental results on the D4RL benchmark dataset demonstrate the effectiveness of MPE and the performance superiority of MOAC compared to the state-of-the-art offline reinforcement learning (RL) algorithms.
ISSN:	2162-237X 2162-2388
DOI:	10.1109/TNNLS.2023.3309906