Reward Modeling with Ordinal Feedback: Wisdom of the Crowd
Learning a reward model (RM) from human preferences has been an important component in aligning large language models (LLMs). The canonical setup of learning RMs from pairwise preference data is rooted in the classic Bradley-Terry (BT) model that accepts binary feedback, i.e., the label being either...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Learning a reward model (RM) from human preferences has been an important
component in aligning large language models (LLMs). The canonical setup of
learning RMs from pairwise preference data is rooted in the classic
Bradley-Terry (BT) model that accepts binary feedback, i.e., the label being
either Response 1 is better than Response 2, or the opposite. Such a setup
inevitably discards potentially useful samples (such as "tied" between the two
responses) and loses more fine-grained information (such as "slightly better").
In this paper, we propose a framework for learning RMs under ordinal feedback
which generalizes the case of binary preference feedback to any arbitrary
granularity. Specifically, we first identify a marginal unbiasedness condition,
which generalizes the assumption of the BT model in the existing binary
feedback setting. The condition validates itself via the sociological concept
of the wisdom of the crowd. Under the condition, we develop a natural
probability model for pairwise preference data under ordinal feedback and
analyze its properties. We prove the statistical benefits of ordinal feedback
in terms of reducing the Rademacher complexity compared to the case of binary
feedback. The proposed learning objective and the theory also extend to hinge
loss and direct policy optimization (DPO). In particular, the theoretical
analysis may be of independent interest when applying to a seemingly unrelated
problem of knowledge distillation to interpret the bias-variance trade-off
therein. The framework also sheds light on writing guidance for human
annotators. Our numerical experiments validate that fine-grained feedback leads
to better reward learning for both in-distribution and out-of-distribution
settings. Further experiments show that incorporating a certain proportion of
samples with tied preference boosts RM learning. |
---|---|
DOI: | 10.48550/arxiv.2411.12843 |