Language Models Resist Alignment: Evidence From Data Compression
Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large language models (LLMs) may exhibit unintended or undesirable behaviors.
Recent works have concentrated on aligning LLMs to mitigate harmful outputs.
Despite these efforts, some anomalies indicate that even a well-conducted
alignment process can be easily circumvented, whether intentionally or
accidentally. Does alignment fine-tuning yield have robust effects on models,
or are its impacts merely superficial? In this work, we make the first
exploration of this phenomenon from both theoretical and empirical
perspectives. Empirically, we demonstrate the elasticity of post-alignment
models, i.e., the tendency to revert to the behavior distribution formed during
the pre-training phase upon further fine-tuning. Leveraging compression theory,
we formally deduce that fine-tuning disproportionately undermines alignment
relative to pre-training, potentially by orders of magnitude. We validate the
presence of elasticity through experiments on models of varying types and
scales. Specifically, we find that model performance declines rapidly before
reverting to the pre-training distribution, after which the rate of decline
drops significantly. Furthermore, we further reveal that elasticity positively
correlates with the increased model size and the expansion of pre-training
data. Our findings underscore the need to address the inherent elasticity of
LLMs to mitigate their resistance to alignment. |
---|---|
DOI: | 10.48550/arxiv.2406.06144 |