Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits
We present Second Thought, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, Second Thought not only ac...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present Second Thought, a new learning paradigm that enables language
models (LMs) to re-align with human values. By modeling the chain-of-edits
between value-unaligned and value-aligned text, with LM fine-tuning and
additional refinement through reinforcement learning, Second Thought not only
achieves superior performance in three value alignment benchmark datasets but
also shows strong human-value transfer learning ability in few-shot scenarios.
The generated editing steps also offer better interpretability and ease for
interactive error correction. Extensive human evaluations further confirm its
effectiveness. |
---|---|
DOI: | 10.48550/arxiv.2301.00355 |