Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one respons...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-07
Hauptverfasser:	Pal, Arka, Karkhanis, Deep, Dooley, Samuel, Roberts, Manley, Naidu, Siddartha, White, Colin
Format:	Artikel
Sprache:	eng
Schlagworte:	Datasets Failure modes Large language models Optimization
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Schreiben Sie den ersten Kommentar!