A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning
Offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Built on offline RL algorithms, most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples....
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Offline-to-online Reinforcement Learning (O2O RL) aims to improve the
performance of offline pretrained policy using only a few online samples. Built
on offline RL algorithms, most O2O methods focus on the balance between RL
objective and pessimism, or the utilization of offline and online samples. In
this paper, from a novel perspective, we systematically study the challenges
that remain in O2O RL and identify that the reason behind the slow improvement
of the performance and the instability of online finetuning lies in the
inaccurate Q-value estimation inherited from offline pretraining. Specifically,
we demonstrate that the estimation bias and the inaccurate rank of Q-value
cause a misleading signal for the policy update, making the standard offline RL
algorithms, such as CQL and TD3-BC, ineffective in the online finetuning. Based
on this observation, we address the problem of Q-value estimation by two
techniques: (1) perturbed value update and (2) increased frequency of Q-value
updates. The first technique smooths out biased Q-value estimation with sharp
peaks, preventing early-stage policy exploitation of sub-optimal actions. The
second one alleviates the estimation bias inherited from offline pretraining by
accelerating learning. Extensive experiments on the MuJoco and Adroit
environments demonstrate that the proposed method, named SO2, significantly
alleviates Q-value estimation issues, and consistently improves the performance
against the state-of-the-art methods by up to 83.1%. |
---|---|
DOI: | 10.48550/arxiv.2312.07685 |