Transformers need glasses! Information over-squashing in language tasks
We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We study how information propagates in decoder-only Transformers, which are
the architectural backbone of most existing frontier large language models
(LLMs). We rely on a theoretical signal propagation analysis -- specifically,
we analyse the representations of the last token in the final layer of the
Transformer, as this is the representation used for next-token prediction. Our
analysis reveals a representational collapse phenomenon: we prove that certain
distinct sequences of inputs to the Transformer can yield arbitrarily close
representations in the final token. This effect is exacerbated by the
low-precision floating-point formats frequently used in modern LLMs. As a
result, the model is provably unable to respond to these sequences in different
ways -- leading to errors in, e.g., tasks involving counting or copying.
Further, we show that decoder-only Transformer language models can lose
sensitivity to specific tokens in the input, which relates to the well-known
phenomenon of over-squashing in graph neural networks. We provide empirical
evidence supporting our claims on contemporary LLMs. Our theory also points to
simple solutions towards ameliorating these issues. |
---|---|
DOI: | 10.48550/arxiv.2406.04267 |