GTCO: Graph and Tensor Co-Design for Transformer-Based Image Recognition on Tensor Cores

Deep learning frameworks or compilers optimize the operators in computation graph using fixed templates via significant engineering efforts, which may miss potential optimizations such as operator fusion. Therefore, automatically implementing and optimizing the emerging new combinations of operators...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computer-aided design of integrated circuits and systems 2024-02, Vol.43 (2), p.586-599
Hauptverfasser:	Bai, Yang, Yao, Xufeng, Sun, Qi, Zhao, Wenqian, Chen, Shixin, Wang, Zixiao, Yu, Bei
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Co-design Compilation Computation Computational modeling Deep learning Dynamic programming GPU acceleration Graphics processing units Hardware Hardware acceleration Image recognition Mathematical analysis operator fusion Operators (mathematics) Optimization tensor core Tensors transformer Transformer cores Transformers
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Deep learning frameworks or compilers optimize the operators in computation graph using fixed templates via significant engineering efforts, which may miss potential optimizations such as operator fusion. Therefore, automatically implementing and optimizing the emerging new combinations of operators on a specific hardware accelerator is of importance. In this article, we introduce GTCO, a tensor compilation system designed to accelerate transformer-based vision models' inference on GPUs. GTCO tackles the operator fusion techniques in the transformer-based model using a novel dynamic programming algorithm and proposes a search policy with new sketch generation rules for the fused batch matrix multiplication and softmax operators. Tensor programs are sampled from an effective search space, and a hardware abstraction with hierarchical mapping from tensor computation to domain-specific accelerators (Tensor Cores) is formally defined. Finally, our framework can map and transform tensor expression into efficient CUDA kernels with hardware intrinsics on GPU. Our experimental results demonstrate that GTCO improves the end-to-end execution performance by up to 1.73\times relative to the cutting-edge deep learning library TensorRT on NVIDIA GPUs with Tensor Cores.
ISSN:	0278-0070 1937-4151
DOI:	10.1109/TCAD.2023.3317169