ENTP: Encoder-only Next Token Prediction
Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token P...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Next-token prediction is conventionally done using decoder-only Transformers
with causal attention, as this approach allows for efficient reuse of keys and
values. What if we were not compute-limited, should we still use decoder-only
Transformers? In this work, we introduce Encoder-only Next Token Prediction
(ENTP). We use small scale experiments to explore the differences between ENTP
and decoders, highlighting potential advantages of ENTP in setting with
unbounded compute. We introduce the Count3 task and show, both theoretically
and experimentally, that while ENTP can perform this task easily, a
decoder-only Transformer cannot. Finally, we empirically demonstrate ENTP's
superior performance across various synthetic tasks, such as length
generalization and in-context learning. |
---|---|
DOI: | 10.48550/arxiv.2410.01600 |