ROPO: Robust Preference Optimization for Large Language Models
Preference alignment is pivotal for empowering large language models (LLMs) to generate helpful and harmless responses. However, the performance of preference alignment is highly sensitive to the prevalent noise in the preference data. Recent efforts for this problem either marginally alleviate the...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Preference alignment is pivotal for empowering large language models (LLMs)
to generate helpful and harmless responses. However, the performance of
preference alignment is highly sensitive to the prevalent noise in the
preference data. Recent efforts for this problem either marginally alleviate
the impact of noise without the ability to actually reduce its presence, or
rely on costly teacher LLMs prone to reward misgeneralization. To address these
challenges, we propose the RObust Preference Optimization (ROPO) framework, an
iterative alignment approach that integrates noise-tolerance and filtering of
noisy samples without the aid of external models. Specifically, ROPO
iteratively solves a constrained optimization problem, where we dynamically
assign a quality-aware weight for each sample and constrain the sum of the
weights to the number of samples we intend to retain. For noise-tolerant
training and effective noise identification, we derive a robust loss by
suppressing the gradients of samples with high uncertainty. We demonstrate both
empirically and theoretically that the derived loss is critical for
distinguishing noisy samples from clean ones. Furthermore, inspired by our
derived loss, we propose a robustness-guided rejection sampling technique to
compensate for the potential important information in discarded queries.
Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B
demonstrate that ROPO significantly outperforms existing preference alignment
methods, with its superiority growing as the noise rate increases. |
---|---|
DOI: | 10.48550/arxiv.2404.04102 |