How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad
Can Transformers predict new syllogisms by composing established ones? More generally, what type of targets can be learned by such models from scratch? Recent works show that Transformers can be Turing-complete in terms of expressivity, but this does not address the learnability objective. This pape...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Can Transformers predict new syllogisms by composing established ones? More
generally, what type of targets can be learned by such models from scratch?
Recent works show that Transformers can be Turing-complete in terms of
expressivity, but this does not address the learnability objective. This paper
puts forward the notion of 'globality degree' of a target distribution to
capture when weak learning is efficiently achievable by regular Transformers.
This measure shows a contrast with the expressivity results of Transformers
captured by $TC^0/TC^1$ classes (further studied here), since the globality
relates to correlations with the more limited $NC^0$ class. We show here
experimentally and theoretically under additional assumptions that
distributions with high globality cannot be learned efficiently. In particular,
syllogisms cannot be composed on long chains. Further, we develop scratchpad
techniques and show that: (i) agnostic scratchpads cannot break the globality
barrier, (ii) educated scratchpads can break the globality with intermediate
steps, although not all such scratchpads can generalize out-of-distribution
(OOD), (iii) a notion of 'inductive scratchpad', that composes the prior
information more efficiently, can both break the globality barrier and improve
the OOD generalization. In particular, some of our inductive scratchpads can
achieve length generalizations of up to $6\times$ for some arithmetic tasks
depending on the input formatting. |
---|---|
DOI: | 10.48550/arxiv.2406.06467 |