Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales
Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) can introduce additional difficulty. Differing preferences can complicat...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Reinforcement learning (RL) training is inherently unstable due to factors
such as moving targets and high gradient variance. Reinforcement Learning from
Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) can
introduce additional difficulty. Differing preferences can complicate the
alignment process, and prediction errors in a trained reward model can become
more severe as the LLM generates unseen outputs. To enhance training
robustness, RL has adopted techniques from supervised learning, such as
ensembles and layer normalization. In this work, we improve the stability of RL
training by adapting the reverse cross entropy (RCE) from supervised learning
for noisy data to define a symmetric RL loss. We demonstrate performance
improvements across various tasks and scales. We conduct experiments in
discrete action tasks (Atari games) and continuous action space tasks (MuJoCo
benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO), with
and without added noise with especially notable performance in SPPO across
different hyperparameters. Furthermore, we validate the benefits of the
symmetric RL loss when using SPPO for large language models through improved
performance in RLHF tasks, such as IMDB positive sentiment sentiment and TL;DR
summarization tasks. |
---|---|
DOI: | 10.48550/arxiv.2405.17618 |