Block-State Transformers
State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tas...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | State space models (SSMs) have shown impressive results on tasks that require
modeling long-range dependencies and efficiently scale to long sequences owing
to their subquadratic runtime complexity. Originally designed for continuous
signals, SSMs have shown superior performance on a plethora of tasks, in vision
and audio; however, SSMs still lag Transformer performance in Language Modeling
tasks. In this work, we propose a hybrid layer named Block-State Transformer
(BST), that internally combines an SSM sublayer for long-range
contextualization, and a Block Transformer sublayer for short-term
representation of sequences. We study three different, and completely
parallelizable, variants that integrate SSMs and block-wise attention. We show
that our model outperforms similar Transformer-based architectures on language
modeling perplexity and generalizes to longer sequences. In addition, the
Block-State Transformer demonstrates more than tenfold increase in speed at the
layer level compared to the Block-Recurrent Transformer when model
parallelization is employed. |
---|---|
DOI: | 10.48550/arxiv.2306.09539 |