SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling
Benefiting from the self-attention mechanism, Transformer models have attained impressive contextual comprehension capabilities for lengthy texts. The requirements of high-throughput inference arise as the large language models (LLMs) become increasingly prevalent, which calls for large-scale token...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Benefiting from the self-attention mechanism, Transformer models have
attained impressive contextual comprehension capabilities for lengthy texts.
The requirements of high-throughput inference arise as the large language
models (LLMs) become increasingly prevalent, which calls for large-scale token
parallel processing (LTPP). However, existing dynamic sparse accelerators
struggle to effectively handle LTPP, as they solely focus on separate stage
optimization, and with most efforts confined to computational enhancements. By
re-examining the end-to-end flow of dynamic sparse acceleration, we pinpoint an
ever-overlooked opportunity that the LTPP can exploit the intrinsic
coordination among stages to avoid excessive memory access and redundant
computation. Motivated by our observation, we present SOFA, a cross-stage
compute-memory efficient algorithm-hardware co-design, which is tailored to
tackle the challenges posed by LTPP of Transformer inference effectively. We
first propose a novel leading zero computing paradigm, which predicts attention
sparsity by using log-based add-only operations to avoid the significant
overhead of prediction. Then, a distributed sorting and a sorted updating
FlashAttention mechanism are proposed with a cross-stage coordinated tiling
principle, which enables fine-grained and lightweight coordination among
stages, helping optimize memory access and latency. Further, we propose a SOFA
accelerator to support these optimizations efficiently. Extensive experiments
on 20 benchmarks show that SOFA achieves $9.5\times$ speed up and $71.5\times$
higher energy efficiency than Nvidia A100 GPU. Compared to 8 SOTA accelerators,
SOFA achieves an average $15.8\times$ energy efficiency, $10.3\times$ area
efficiency and $9.3\times$ speed up, respectively. |
---|---|
DOI: | 10.48550/arxiv.2407.10416 |