HSR-Enhanced Sparse Attention Acceleration
Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. This paper introduces a novel approach to accelerate attention computation in LLMs...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Chen, Bo Liang, Yingyu Sha, Zhizhou Shi, Zhenmei Song, Zhao |
description | Large Language Models (LLMs) have demonstrated remarkable capabilities across
various applications, but their performance on long-context tasks is often
limited by the computational complexity of attention mechanisms. This paper
introduces a novel approach to accelerate attention computation in LLMs,
particularly for long-context scenarios. We leverage the inherent sparsity
within attention mechanisms, both in conventional Softmax attention and ReLU
attention (with $\mathsf{ReLU}^\alpha$ activation, $\alpha \in \mathbb{N}_+$),
to significantly reduce the running time complexity. Our method employs a
Half-Space Reporting (HSR) data structure to rapidly identify non-zero or
"massively activated" entries in the attention matrix. We present theoretical
analyses for two key scenarios: attention generation and full attention
computation with long input context. Our approach achieves a running time of
$O(mn^{4/5})$ significantly faster than the naive approach $O(mn)$ for
attention generation, where $n$ is the context length, $m$ is the query length,
and $d$ is the hidden dimension. We can also reduce the running time of full
attention computation from $O(mn)$ to $O(mn^{1 - 1 / \lfloor d/2\rfloor} +
mn^{4/5})$. Importantly, our method introduces no error for ReLU attention and
only provably negligible error for Softmax attention, where the latter is
supported by our empirical validation. This work represents a significant step
towards enabling efficient long-context processing in LLMs, potentially
broadening their applicability across various domains. |
doi_str_mv | 10.48550/arxiv.2410.10165 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2410_10165</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2410_10165</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2410_101653</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMgEKGBoYmplyMmh5BAfpuuZlJOYlp6YoBBckFhWnKjiWlKTmlWTm5yk4Jien5qQWJYI4PAysaYk5xam8UJqbQd7NNcTZQxdsanxBUWZuYlFlPMj0eLDpxoRVAADKTy4Y</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>HSR-Enhanced Sparse Attention Acceleration</title><source>arXiv.org</source><creator>Chen, Bo ; Liang, Yingyu ; Sha, Zhizhou ; Shi, Zhenmei ; Song, Zhao</creator><creatorcontrib>Chen, Bo ; Liang, Yingyu ; Sha, Zhizhou ; Shi, Zhenmei ; Song, Zhao</creatorcontrib><description>Large Language Models (LLMs) have demonstrated remarkable capabilities across
various applications, but their performance on long-context tasks is often
limited by the computational complexity of attention mechanisms. This paper
introduces a novel approach to accelerate attention computation in LLMs,
particularly for long-context scenarios. We leverage the inherent sparsity
within attention mechanisms, both in conventional Softmax attention and ReLU
attention (with $\mathsf{ReLU}^\alpha$ activation, $\alpha \in \mathbb{N}_+$),
to significantly reduce the running time complexity. Our method employs a
Half-Space Reporting (HSR) data structure to rapidly identify non-zero or
"massively activated" entries in the attention matrix. We present theoretical
analyses for two key scenarios: attention generation and full attention
computation with long input context. Our approach achieves a running time of
$O(mn^{4/5})$ significantly faster than the naive approach $O(mn)$ for
attention generation, where $n$ is the context length, $m$ is the query length,
and $d$ is the hidden dimension. We can also reduce the running time of full
attention computation from $O(mn)$ to $O(mn^{1 - 1 / \lfloor d/2\rfloor} +
mn^{4/5})$. Importantly, our method introduces no error for ReLU attention and
only provably negligible error for Softmax attention, where the latter is
supported by our empirical validation. This work represents a significant step
towards enabling efficient long-context processing in LLMs, potentially
broadening their applicability across various domains.</description><identifier>DOI: 10.48550/arxiv.2410.10165</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2024-10</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2410.10165$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2410.10165$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Bo</creatorcontrib><creatorcontrib>Liang, Yingyu</creatorcontrib><creatorcontrib>Sha, Zhizhou</creatorcontrib><creatorcontrib>Shi, Zhenmei</creatorcontrib><creatorcontrib>Song, Zhao</creatorcontrib><title>HSR-Enhanced Sparse Attention Acceleration</title><description>Large Language Models (LLMs) have demonstrated remarkable capabilities across
various applications, but their performance on long-context tasks is often
limited by the computational complexity of attention mechanisms. This paper
introduces a novel approach to accelerate attention computation in LLMs,
particularly for long-context scenarios. We leverage the inherent sparsity
within attention mechanisms, both in conventional Softmax attention and ReLU
attention (with $\mathsf{ReLU}^\alpha$ activation, $\alpha \in \mathbb{N}_+$),
to significantly reduce the running time complexity. Our method employs a
Half-Space Reporting (HSR) data structure to rapidly identify non-zero or
"massively activated" entries in the attention matrix. We present theoretical
analyses for two key scenarios: attention generation and full attention
computation with long input context. Our approach achieves a running time of
$O(mn^{4/5})$ significantly faster than the naive approach $O(mn)$ for
attention generation, where $n$ is the context length, $m$ is the query length,
and $d$ is the hidden dimension. We can also reduce the running time of full
attention computation from $O(mn)$ to $O(mn^{1 - 1 / \lfloor d/2\rfloor} +
mn^{4/5})$. Importantly, our method introduces no error for ReLU attention and
only provably negligible error for Softmax attention, where the latter is
supported by our empirical validation. This work represents a significant step
towards enabling efficient long-context processing in LLMs, potentially
broadening their applicability across various domains.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMgEKGBoYmplyMmh5BAfpuuZlJOYlp6YoBBckFhWnKjiWlKTmlWTm5yk4Jien5qQWJYI4PAysaYk5xam8UJqbQd7NNcTZQxdsanxBUWZuYlFlPMj0eLDpxoRVAADKTy4Y</recordid><startdate>20241014</startdate><enddate>20241014</enddate><creator>Chen, Bo</creator><creator>Liang, Yingyu</creator><creator>Sha, Zhizhou</creator><creator>Shi, Zhenmei</creator><creator>Song, Zhao</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241014</creationdate><title>HSR-Enhanced Sparse Attention Acceleration</title><author>Chen, Bo ; Liang, Yingyu ; Sha, Zhizhou ; Shi, Zhenmei ; Song, Zhao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2410_101653</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Bo</creatorcontrib><creatorcontrib>Liang, Yingyu</creatorcontrib><creatorcontrib>Sha, Zhizhou</creatorcontrib><creatorcontrib>Shi, Zhenmei</creatorcontrib><creatorcontrib>Song, Zhao</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Bo</au><au>Liang, Yingyu</au><au>Sha, Zhizhou</au><au>Shi, Zhenmei</au><au>Song, Zhao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>HSR-Enhanced Sparse Attention Acceleration</atitle><date>2024-10-14</date><risdate>2024</risdate><abstract>Large Language Models (LLMs) have demonstrated remarkable capabilities across
various applications, but their performance on long-context tasks is often
limited by the computational complexity of attention mechanisms. This paper
introduces a novel approach to accelerate attention computation in LLMs,
particularly for long-context scenarios. We leverage the inherent sparsity
within attention mechanisms, both in conventional Softmax attention and ReLU
attention (with $\mathsf{ReLU}^\alpha$ activation, $\alpha \in \mathbb{N}_+$),
to significantly reduce the running time complexity. Our method employs a
Half-Space Reporting (HSR) data structure to rapidly identify non-zero or
"massively activated" entries in the attention matrix. We present theoretical
analyses for two key scenarios: attention generation and full attention
computation with long input context. Our approach achieves a running time of
$O(mn^{4/5})$ significantly faster than the naive approach $O(mn)$ for
attention generation, where $n$ is the context length, $m$ is the query length,
and $d$ is the hidden dimension. We can also reduce the running time of full
attention computation from $O(mn)$ to $O(mn^{1 - 1 / \lfloor d/2\rfloor} +
mn^{4/5})$. Importantly, our method introduces no error for ReLU attention and
only provably negligible error for Softmax attention, where the latter is
supported by our empirical validation. This work represents a significant step
towards enabling efficient long-context processing in LLMs, potentially
broadening their applicability across various domains.</abstract><doi>10.48550/arxiv.2410.10165</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2410.10165 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2410_10165 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning |
title | HSR-Enhanced Sparse Attention Acceleration |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T09%3A07%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=HSR-Enhanced%20Sparse%20Attention%20Acceleration&rft.au=Chen,%20Bo&rft.date=2024-10-14&rft_id=info:doi/10.48550/arxiv.2410.10165&rft_dat=%3Carxiv_GOX%3E2410_10165%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |