Mobile Transformer Accelerator Exploiting Various Line Sparsity and Tile-Based Dynamic Quantization

Transformer models are difficult to employ in mobile devices due to their memory- and computation-intensive properties. Accordingly, there is ongoing research on various methods for compressing transformer models, such as pruning and quantization. However, general computing platforms, such as centra...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computer-aided design of integrated circuits and systems 2024-06, Vol.43 (6), p.1808-1821
Hauptverfasser:	Kwon, Eunji, Yoon, Jongho, Kang, Seokhyeong
Format:	Artikel
Sprache:	eng
Schlagworte:	Accelerator architectures Central Processing Unit Central processing units CPUs Energy efficiency Graphics processing units Memory devices Multiplication Pruning Quantization (signal) Sparsity Transformer accelerator transformer optimization Transformers vision transformer (ViT)
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Transformer models are difficult to employ in mobile devices due to their memory- and computation-intensive properties. Accordingly, there is ongoing research on various methods for compressing transformer models, such as pruning and quantization. However, general computing platforms, such as central processing units (CPUs) and graphics processing units (GPUs), are not energy-efficient to accelerate the pruned model because the unstructured sparsity they exhibit causes degradation of parallelism. In this article, we propose a low-power accelerator for transformers that can handle various levels of structured sparsity induced by line pruning with different granularity. Our approach accelerates pruned transformers in a head-wise and line-wise manner. We present a head reorganization and shuffling method that supports head-wise skip operations and resolves the load imbalance problem among processing engines (PEs) caused by the varying number of operations in each head. Furthermore, we implemented a sparse quantized general matrix-to-matrix multiplication (SQ-GEMM) module that supports line-wise skipping and on-the-fly tile-based dynamic quantization of activations. As a result, compared to mobile GPU and CPU, the proposed accelerator improved the energy efficiency by 2.9\times and 12.3\times for the detection transformer (DETR), and 3.0\times and 12.4\times for the vision transformer (ViT) models, respectively. In addition, our proposed mobile accelerator achieved the highest-energy efficiency among the current state-of-the-art FPGA-based transformer accelerators.
ISSN:	0278-0070 1937-4151
DOI:	10.1109/TCAD.2023.3347291