What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in de...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The Transformer architecture has inarguably revolutionized deep learning,
overtaking classical architectures like multi-layer perceptrons (MLPs) and
convolutional neural networks (CNNs). At its core, the attention block differs
in form and functionality from most other architectural components in deep
learning -- to the extent that Transformers are often accompanied by adaptive
optimizers, layer normalization, learning rate warmup, and more, in comparison
to MLPs/CNNs. The root causes behind these outward manifestations, and the
precise mechanisms that govern them, remain poorly understood. In this work, we
bridge this gap by providing a fundamental understanding of what distinguishes
the Transformer from the other architectures -- grounded in a theoretical
comparison of the (loss) Hessian. Concretely, for a single self-attention
layer, (a) we first entirely derive the Transformer's Hessian and express it in
matrix derivatives; (b) we then characterize it in terms of data, weight, and
attention moment dependencies; and (c) while doing so further highlight the
important structural differences to the Hessian of classical networks. Our
results suggest that various common architectural and optimization choices in
Transformers can be traced back to their highly non-linear dependencies on the
data and weight matrices, which vary heterogeneously across parameters.
Ultimately, our findings provide a deeper understanding of the Transformer's
unique optimization landscape and the challenges it poses. |
---|---|
DOI: | 10.48550/arxiv.2410.10986 |