Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels
When prompting a language model (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, t...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | When prompting a language model (LM), users often expect the model to adhere
to a set of behavioral principles across diverse tasks, such as producing
insightful content while avoiding harmful or biased language. Instilling such
principles (i.e., a constitution) into a model is resource-intensive,
technically challenging, and generally requires human preference labels or
examples. We introduce SAMI, an iterative algorithm that finetunes a pretrained
language model (without requiring preference labels or demonstrations) to
increase the conditional mutual information between constitutions and
self-generated responses given queries from a dataset. On single-turn dialogue
and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained
model, with win rates between 66% and 77%. Strikingly, it also surpasses an
instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55%
and 57% on single-turn dialogue. SAMI requires a model that writes the
principles. To avoid dependence on strong models for writing principles, we
align a strong pretrained model (mixtral-8x7b) using constitutions written by a
weak instruction-finetuned model (mistral-7b-instruct), achieving a 65% win
rate on summarization. Finally, we investigate whether SAMI generalizes to
diverse summarization principles (e.g., "summaries should be scientific") and
scales to stronger models (llama3-70b), finding that it achieves win rates of
up to 68% for learned and 67% for held-out principles compared to the base
model. Our results show that a pretrained LM can learn to follow constitutions
without using preference labels, demonstrations, or human oversight. |
---|---|
DOI: | 10.48550/arxiv.2404.14313 |