Investigating Regularization of Self-Play Language Models
This paper explores the effects of various forms of regularization in the context of language model alignment via self-play. While both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) require to collect costly human-annotated pairwise preferences, the self-...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper explores the effects of various forms of regularization in the
context of language model alignment via self-play. While both reinforcement
learning from human feedback (RLHF) and direct preference optimization (DPO)
require to collect costly human-annotated pairwise preferences, the self-play
fine-tuning (SPIN) approach replaces the rejected answers by data generated
from the previous iterate. However, the SPIN method presents a performance
instability issue in the learning phase, which can be mitigated by playing
against a mixture of the two previous iterates. In the same vein, we propose in
this work to address this issue from two perspectives: first, by incorporating
an additional Kullback-Leibler (KL) regularization to stay at the proximity of
the reference policy; second, by using the idea of fictitious play which
smoothens the opponent policy across all previous iterations. In particular, we
show that the KL-based regularizer boils down to replacing the previous policy
by its geometric mixture with the base policy inside of the SPIN loss function.
We finally discuss empirical results on MT-Bench as well as on the Hugging Face
Open LLM Leaderboard. |
---|---|
DOI: | 10.48550/arxiv.2404.04291 |