A General Theoretical Paradigm to Understand Learning from Human Preferences
The prevalent deployment of learning from human preferences through reinforcement learning (RLHF) relies on two important approximations: the first assumes that pairwise preferences can be substituted with pointwise rewards. The second assumes that a reward model trained on these pointwise rewards c...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The prevalent deployment of learning from human preferences through
reinforcement learning (RLHF) relies on two important approximations: the first
assumes that pairwise preferences can be substituted with pointwise rewards.
The second assumes that a reward model trained on these pointwise rewards can
generalize from collected data to out-of-distribution data sampled by the
policy. Recently, Direct Preference Optimisation (DPO) has been proposed as an
approach that bypasses the second approximation and learn directly a policy
from collected data without the reward modelling stage. However, this method
still heavily relies on the first approximation.
In this paper we try to gain a deeper theoretical understanding of these
practical algorithms. In particular we derive a new general objective called
$\Psi$PO for learning from human preferences that is expressed in terms of
pairwise preferences and therefore bypasses both approximations. This new
general objective allows us to perform an in-depth analysis of the behavior of
RLHF and DPO (as special cases of $\Psi$PO) and to identify their potential
pitfalls. We then consider another special case for $\Psi$PO by setting $\Psi$
simply to Identity, for which we can derive an efficient optimisation
procedure, prove performance guarantees and demonstrate its empirical
superiority to DPO on some illustrative examples. |
---|---|
DOI: | 10.48550/arxiv.2310.12036 |