Setting the Record Straight on Transformer Oversmoothing
Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown empirically and theoretically that Transformers are inherently limited. Specifically, they argue that as model depth increases, Transformers oversmooth, i.e., inpu...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Transformer-based models have recently become wildly successful across a
diverse set of domains. At the same time, recent work has shown empirically and
theoretically that Transformers are inherently limited. Specifically, they
argue that as model depth increases, Transformers oversmooth, i.e., inputs
become more and more similar. A natural question is: How can Transformers
achieve these successes given this shortcoming? In this work we test these
observations empirically and theoretically and uncover a number of surprising
findings. We find that there are cases where feature similarity increases but,
contrary to prior results, this is not inevitable, even for existing
pre-trained models. Theoretically, we show that smoothing behavior depends on
the eigenspectrum of the value and projection weights. We verify this
empirically and observe that the sign of layer normalization weights can
influence this effect. Our analysis reveals a simple way to parameterize the
weights of the Transformer update equations to influence smoothing behavior. We
hope that our findings give ML researchers and practitioners additional insight
into how to develop future Transformer-based models. |
---|---|
DOI: | 10.48550/arxiv.2401.04301 |