FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs
FlashAttention series has been widely applied in the inference of large language models (LLMs). However, FlashAttention series only supports the high-level GPU architectures, e.g., Ampere and Hopper. At present, FlashAttention series is not easily transferrable to NPUs and low-resource GPUs. Moreove...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | FlashAttention series has been widely applied in the inference of large
language models (LLMs). However, FlashAttention series only supports the
high-level GPU architectures, e.g., Ampere and Hopper. At present,
FlashAttention series is not easily transferrable to NPUs and low-resource
GPUs. Moreover, FlashAttention series is inefficient for multi- NPUs or GPUs
inference scenarios. In this work, we propose FastAttention which pioneers the
adaptation of FlashAttention series for NPUs and low-resource GPUs to boost LLM
inference efficiency. Specifically, we take Ascend NPUs and Volta-based GPUs as
representatives for designing our FastAttention. We migrate FlashAttention
series to Ascend NPUs by proposing a novel two-level tiling strategy for
runtime speedup, tiling-mask strategy for memory saving and the
tiling-AllReduce strategy for reducing communication overhead, respectively.
Besides, we adapt FlashAttention for Volta-based GPUs by redesigning the
operands layout in shared memory and introducing a simple yet effective CPU-GPU
cooperative strategy for efficient memory utilization. On Ascend NPUs, our
FastAttention can achieve a 10.7$\times$ speedup compared to the standard
attention implementation. Llama-7B within FastAttention reaches up to
5.16$\times$ higher throughput than within the standard attention. On Volta
architecture GPUs, FastAttention yields 1.43$\times$ speedup compared to its
equivalents in \texttt{xformers}. Pangu-38B within FastAttention brings
1.46$\times$ end-to-end speedup using FasterTransformer. Coupled with the
propose CPU-GPU cooperative strategy, FastAttention supports a maximal input
length of 256K on 8 V100 GPUs. All the codes will be made available soon. |
---|---|
DOI: | 10.48550/arxiv.2410.16663 |