HSR-Enhanced Sparse Attention Acceleration
Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. This paper introduces a novel approach to accelerate attention computation in LLMs...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large Language Models (LLMs) have demonstrated remarkable capabilities across
various applications, but their performance on long-context tasks is often
limited by the computational complexity of attention mechanisms. This paper
introduces a novel approach to accelerate attention computation in LLMs,
particularly for long-context scenarios. We leverage the inherent sparsity
within attention mechanisms, both in conventional Softmax attention and ReLU
attention (with $\mathsf{ReLU}^\alpha$ activation, $\alpha \in \mathbb{N}_+$),
to significantly reduce the running time complexity. Our method employs a
Half-Space Reporting (HSR) data structure to rapidly identify non-zero or
"massively activated" entries in the attention matrix. We present theoretical
analyses for two key scenarios: attention generation and full attention
computation with long input context. Our approach achieves a running time of
$O(mn^{4/5})$ significantly faster than the naive approach $O(mn)$ for
attention generation, where $n$ is the context length, $m$ is the query length,
and $d$ is the hidden dimension. We can also reduce the running time of full
attention computation from $O(mn)$ to $O(mn^{1 - 1 / \lfloor d/2\rfloor} +
mn^{4/5})$. Importantly, our method introduces no error for ReLU attention and
only provably negligible error for Softmax attention, where the latter is
supported by our empirical validation. This work represents a significant step
towards enabling efficient long-context processing in LLMs, potentially
broadening their applicability across various domains. |
---|---|
DOI: | 10.48550/arxiv.2410.10165 |