BOND: Aligning LLMs with Best-of-N Distillation
Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-o...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Reinforcement learning from human feedback (RLHF) is a key driver of quality
and safety in state-of-the-art large language models. Yet, a surprisingly
simple and strong inference-time strategy is Best-of-N sampling that selects
the best generation among N candidates. In this paper, we propose Best-of-N
Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but
without its significant computational overhead at inference time. Specifically,
BOND is a distribution matching algorithm that forces the distribution of
generations from the policy to get closer to the Best-of-N distribution. We use
the Jeffreys divergence (a linear combination of forward and backward KL) to
balance between mode-covering and mode-seeking behavior, and derive an
iterative formulation that utilizes a moving anchor for efficiency. We
demonstrate the effectiveness of our approach and several design choices
through experiments on abstractive summarization and Gemma models. Aligning
Gemma policies with BOND outperforms other RLHF algorithms by improving results
on several benchmarks. |
---|---|
DOI: | 10.48550/arxiv.2407.14622 |