SAMG: State-Action-Aware Offline-to-Online Reinforcement Learning with Offline Model Guidance
The offline-to-online (O2O) paradigm in reinforcement learning (RL) utilizes pre-trained models on offline datasets for subsequent online fine-tuning. However, conventional O2O RL algorithms typically require maintaining and retraining the large offline datasets to mitigate the effects of out-of-dis...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The offline-to-online (O2O) paradigm in reinforcement learning (RL) utilizes
pre-trained models on offline datasets for subsequent online fine-tuning.
However, conventional O2O RL algorithms typically require maintaining and
retraining the large offline datasets to mitigate the effects of
out-of-distribution (OOD) data, which limits their efficiency in exploiting
online samples. To address this challenge, we introduce a new paradigm called
SAMG: State-Action-Conditional Offline-to-Online Reinforcement Learning with
Offline Model Guidance. In particular, rather than directly training on offline
data, SAMG freezes the pre-trained offline critic to provide offline values for
each state-action pair to deliver compact offline information. This framework
eliminates the need for retraining with offline data by freezing and leveraging
these values of the offline model. These are then incorporated with the online
target critic using a Bellman equation weighted by a policy state-action-aware
coefficient. This coefficient, derived from a conditional variational
auto-encoder (C-VAE), aims to capture the reliability of the offline data on a
state-action level. SAMG could be easily integrated with existing Q-function
based O2O RL algorithms. Theoretical analysis shows good optimality and lower
estimation error of SAMG. Empirical evaluations demonstrate that SAMG
outperforms four state-of-the-art O2O RL algorithms in the D4RL benchmark. |
---|---|
DOI: | 10.48550/arxiv.2410.18626 |