Improving Transformer Inference Through Optimized Non-Linear Operations With Quantization-Approximation-Based Strategy

Transformers have recently shown significant performance across various tasks such as natural language processing(NLP) and computer vision(CV). However, the performance comes at the cost of large memory and computation overhead. Existing researches primarily focus on accelerating matrix multiplicati...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on computer-aided design of integrated circuits and systems 2024-10, p.1-1
Hauptverfasser: Wang, Wenxun, Sun, Wenyu, Liu, Yongpan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Transformers have recently shown significant performance across various tasks such as natural language processing(NLP) and computer vision(CV). However, the performance comes at the cost of large memory and computation overhead. Existing researches primarily focus on accelerating matrix multiplication(MatMul) through techniques like quantization and pruning, notably increasing the proportion of non-linear operations in inference runtime. Meanwhile, previous approaches designed for non-linear operations struggle with inefficient implementation as they are incapable of achieving both computation and memory efficiency. Additionally, these methods often require re-training or fine-tuning leading to substantial costs and inconveniences. To overcome these problems, we propose efficient implementation of non-linear operations with quantization-approximation-based strategy. Through an in-depth analysis of the dataflow and data distribution of non-linear operations, we design distinct quantization and approximation strategies tailored for different operations. Specifically, log2 quantization and PTF quantization have been employed in Softmax and LayerNorm, complemented by logarithmic function and low-precision statistic calculation as approximation strategies. Furthermore, the proposed efficient GeLU implementation integrates a non-uniform lookup procedure alongside low bit-width quantization. Experimental results demonstrate negligible accuracy drops without the need for retraining or fine-tuning. By implementing the hardware design, it achieves 3.14×-6.34× energy-efficiency and 3.01×-10.1× area-efficiency improvements compared to state-of-the-art ASIC designs. In system-level evaluation, substantial speedup and reductions in energy consumption of 15% to 35% are achieved for end-to-end inference across both GPU and ASIC accelerator platforms.
ISSN:0278-0070
1937-4151
DOI:10.1109/TCAD.2024.3488572