Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning
Offline reinforcement learning (RL) aims to learn optimal policies from previously collected datasets. Recently, due to their powerful representational capabilities, diffusion models have shown significant potential as policy models for offline RL issues. However, previous offline RL algorithms base...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Offline reinforcement learning (RL) aims to learn optimal policies from
previously collected datasets. Recently, due to their powerful representational
capabilities, diffusion models have shown significant potential as policy
models for offline RL issues. However, previous offline RL algorithms based on
diffusion policies generally adopt weighted regression to improve the policy.
This approach optimizes the policy only using the collected actions and is
sensitive to Q-values, which limits the potential for further performance
enhancement. To this end, we propose a novel preferred-action-optimized
diffusion policy for offline RL. In particular, an expressive conditional
diffusion model is utilized to represent the diverse distribution of a behavior
policy. Meanwhile, based on the diffusion model, preferred actions within the
same behavior distribution are automatically generated through the critic
function. Moreover, an anti-noise preference optimization is designed to
achieve policy improvement by using the preferred actions, which can adapt to
noise-preferred actions for stable training. Extensive experiments demonstrate
that the proposed method provides competitive or superior performance compared
to previous state-of-the-art offline RL methods, particularly in sparse reward
tasks such as Kitchen and AntMaze. Additionally, we empirically prove the
effectiveness of anti-noise preference optimization. |
---|---|
DOI: | 10.48550/arxiv.2405.18729 |