From Attention to Activation: Unravelling the Enigmas of Large Language Models
We study two strange phenomena in auto-regressive Transformers: (1) the dominance of the first token in attention heads; (2) the occurrence of large outlier activations in the hidden states. We find that popular large language models, such as Llama attend maximally to the first token in 98% of atten...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We study two strange phenomena in auto-regressive Transformers: (1) the
dominance of the first token in attention heads; (2) the occurrence of large
outlier activations in the hidden states. We find that popular large language
models, such as Llama attend maximally to the first token in 98% of attention
heads, a behaviour we attribute to the softmax function. To mitigate this
issue, we propose a reformulation of softmax to softmax-1. Furthermore, we
identify adaptive optimisers, e.g. Adam, as the primary contributor to the
large outlier activations and introduce OrthoAdam, a novel optimiser that
utilises orthogonal matrices to transform gradients, to address this issue.
Finally, not only do our methods prevent these phenomena from occurring, but
additionally, they enable Transformers to sustain their performance when
quantised using basic algorithms, something that standard methods are unable to
do. In summary, our methods reduce the attention proportion on the first token
from 65% to 3.3%, the activation kurtosis in the hidden states from 1657 to
3.1, and perplexity penalty under 4-bit weight quantisation from 3565 to 0.3. |
---|---|
DOI: | 10.48550/arxiv.2410.17174 |