Aligning Large Language Models with Human Preferences through Representation Engineering
Aligning large language models (LLMs) with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. Existing methods for achieving this alignment often involves employing reinforcement learning from human feedback (RLH...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Aligning large language models (LLMs) with human preferences is crucial for
enhancing their utility in terms of helpfulness, truthfulness, safety,
harmlessness, and interestingness. Existing methods for achieving this
alignment often involves employing reinforcement learning from human feedback
(RLHF) to fine-tune LLMs based on human labels assessing the relative quality
of model responses. Nevertheless, RLHF is susceptible to instability during
fine-tuning and presents challenges in implementation.Drawing inspiration from
the emerging field of representation engineering (RepE), this study aims to
identify relevant representations for high-level human preferences embedded in
patterns of activity within an LLM, and achieve precise control of model
behavior by transforming its representations. This novel approach, denoted as
Representation Alignment from Human Feedback (RAHF), proves to be effective,
computationally efficient, and easy to implement.Extensive experiments
demonstrate the efficacy of RAHF in not only capturing but also manipulating
representations to align with a broad spectrum of human preferences or values,
rather than being confined to a singular concept or function (e.g. honesty or
bias). RAHF's versatility in accommodating diverse human preferences shows its
potential for advancing LLM performance. |
---|---|
DOI: | 10.48550/arxiv.2312.15997 |