A Mathematical Theory of Attention
Attention is a powerful component of modern neural networks across a wide variety of domains. However, despite its ubiquity in machine learning, there is a gap in our understanding of attention from a theoretical point of view. We propose a framework to fill this gap by building a mathematically equ...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Attention is a powerful component of modern neural networks across a wide
variety of domains. However, despite its ubiquity in machine learning, there is
a gap in our understanding of attention from a theoretical point of view. We
propose a framework to fill this gap by building a mathematically equivalent
model of attention using measure theory. With this model, we are able to
interpret self-attention as a system of self-interacting particles, we shed
light on self-attention from a maximum entropy perspective, and we show that
attention is actually Lipschitz-continuous (with an appropriate metric) under
suitable assumptions. We then apply these insights to the problem of
mis-specified input data; infinitely-deep, weight-sharing self-attention
networks; and more general Lipschitz estimates for a specific type of attention
studied in concurrent work. |
---|---|
DOI: | 10.48550/arxiv.2007.02876 |