PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference

Owing to the widespread advancement of transformer-based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix-vector multiplication in addition to the conventional matrix-matrix multiplication. However, current AI processor architectures are optimize...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ETRI journal 2024, Vol.46 (5), p.817-828
Hauptverfasser:	Hyeji Kim, Yeongmin Lee, Chun-Gi Lyuh
Format:	Artikel
Sprache:	kor
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Owing to the widespread advancement of transformer-based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix-vector multiplication in addition to the conventional matrix-matrix multiplication. However, current AI processor architectures are optimized for general matrix-matrix multiplications (GEMMs), which causes significant throughput degradation when processing general matrix-vector multiplications (GEMVs). In this study, we proposed a port-folding GEMV (PF-GEMV) scheme employing multiformat and low-precision techniques while reusing an outer product-based processor optimized for conventional GEMM operations. This approach achieves 93.7% utilization in GEMV operations with an 8-bit format on an 8 × 8 processor, thus resulting in a 7.5 × increase in throughput compared with that of the original scheme. Furthermore, when applied to the matrix operation of the GPT-2 large model, an increase in speed by 7 × is achieved in single-batch inferences.
ISSN:	1225-6463 2233-7326