Inverse Constitutional AI: Compressing Preferences into Principles
Feedback data plays an important role in fine-tuning and evaluating state-of-the-art AI models. Often pairwise text preferences are used: given two texts, human (or AI) annotators select the "better" one. Such feedback data is widely used to align models to human preferences (e.g., reinfor...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Feedback data plays an important role in fine-tuning and evaluating
state-of-the-art AI models. Often pairwise text preferences are used: given two
texts, human (or AI) annotators select the "better" one. Such feedback data is
widely used to align models to human preferences (e.g., reinforcement learning
from human feedback), or to rank models according to human preferences (e.g.,
Chatbot Arena). Despite its wide-spread use, prior work has demonstrated that
human-annotated pairwise text preference data often exhibits unintended biases.
For example, human annotators have been shown to prefer assertive over truthful
texts in certain contexts. Models trained or evaluated on this data may
implicitly encode these biases in a manner hard to identify. In this paper, we
formulate the interpretation of existing pairwise text preference data as a
compression task: the Inverse Constitutional AI (ICAI) problem. In
constitutional AI, a set of principles (or constitution) is used to provide
feedback and fine-tune AI models. The ICAI problem inverts this process: given
a dataset of feedback, we aim to extract a constitution that best enables a
large language model (LLM) to reconstruct the original annotations. We propose
a corresponding initial ICAI algorithm and validate its generated constitutions
quantitatively based on reconstructed annotations. Generated constitutions have
many potential use-cases -- they may help identify undesirable biases, scale
feedback to unseen data or assist with adapting LLMs to individual user
preferences. We demonstrate our approach on a variety of datasets: (a)
synthetic feedback datasets with known underlying principles; (b) the
AlpacaEval dataset of cross-annotated human feedback; and (c) the crowdsourced
Chatbot Arena data set. We release the code for our algorithm and experiments
at https://github.com/rdnfn/icai . |
---|---|
DOI: | 10.48550/arxiv.2406.06560 |