DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers
Scaling multi-dimensional transformers to long sequences is indispensable across various domains. However, the challenges of large memory requirements and slow speeds of such sequences necessitate sequence parallelism. All existing approaches fall under the category of embedded sequence parallelism,...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Scaling multi-dimensional transformers to long sequences is indispensable
across various domains. However, the challenges of large memory requirements
and slow speeds of such sequences necessitate sequence parallelism. All
existing approaches fall under the category of embedded sequence parallelism,
which are limited to shard along a single sequence dimension, thereby
introducing significant communication overhead. However, the nature of
multi-dimensional transformers involves independent calculations across
multiple sequence dimensions. To this end, we propose Dynamic Sequence
Parallelism (DSP) as a novel abstraction of sequence parallelism. DSP
dynamically switches the parallel dimension among all sequences according to
the computation stage with efficient resharding strategy. DSP offers
significant reductions in communication costs, adaptability across modules, and
ease of implementation with minimal constraints. Experimental evaluations
demonstrate DSP's superiority over state-of-the-art embedded sequence
parallelism methods by remarkable throughput improvements ranging from 32.2% to
10x, with less than 25% communication volume. |
---|---|
DOI: | 10.48550/arxiv.2403.10266 |