Uncovering mesa-optimization algorithms in Transformers
Some autoregressive models exhibit in-context learning capabilities: being able to learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so. The origins of this phenomenon are still poorly understood. Here we analyze a series of...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Some autoregressive models exhibit in-context learning capabilities: being
able to learn as an input sequence is processed, without undergoing any
parameter changes, and without being explicitly trained to do so. The origins
of this phenomenon are still poorly understood. Here we analyze a series of
Transformer models trained to perform synthetic sequence prediction tasks, and
discover that standard next-token prediction error minimization gives rise to a
subsidiary learning algorithm that adjusts the model as new inputs are
revealed. We show that this process corresponds to gradient-based optimization
of a principled objective function, which leads to strong generalization
performance on unseen sequences. Our findings explain in-context learning as a
product of autoregressive loss minimization and inform the design of new
optimization-based Transformer layers. |
---|---|
DOI: | 10.48550/arxiv.2309.05858 |