Deliberative Alignment: Reasoning Enables Safer Language Models
As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Alignment, a new paradigm that directly teaches the model safety specifications and trains it to explici...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | As large-scale language models increasingly impact safety-critical domains,
ensuring their reliable adherence to well-defined principles remains a
fundamental challenge. We introduce Deliberative Alignment, a new paradigm that
directly teaches the model safety specifications and trains it to explicitly
recall and accurately reason over the specifications before answering. We used
this approach to align OpenAI's o-series models, and achieved highly precise
adherence to OpenAI's safety policies, without requiring human-written
chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier
by simultaneously increasing robustness to jailbreaks while decreasing
overrefusal rates, and also improves out-of-distribution generalization. We
demonstrate that reasoning over explicitly specified policies enables more
scalable, trustworthy, and interpretable alignment. |
---|---|
DOI: | 10.48550/arxiv.2412.16339 |