LAS: Locality-Aware Scheduling for GEMM-Accelerated Convolutions in GPUs

This article presents a graphics processing unit (GPU) scheduling scheme that maximizes the exploitation of data locality in deep neural networks (DNNs). Convolution is one of the fundamental operations used in DNNs and accounts for more than 90% of the total execution time. To leverage massive thre...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on parallel and distributed systems 2023-05, Vol.34 (5), p.1479-1494
Hauptverfasser: Kim, Hyeonjin, Song, William J.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This article presents a graphics processing unit (GPU) scheduling scheme that maximizes the exploitation of data locality in deep neural networks (DNNs). Convolution is one of the fundamental operations used in DNNs and accounts for more than 90% of the total execution time. To leverage massive thread-level parallelism (TLP) in a GPU, deeply nested convolution loops are lowered (or unrolled) into large matrix multiplication, which trades memory capacity and bandwidth for TLP augmentation. A large workspace matrix is split into tiles of general matrix multiplication (GEMM) and concurrently executed by many thread blocks. Notably, the workspace is filled with a number of duplicate data that originate from the same sources in the input feature map during the lowering process. However, conventional GPU scheduling is oblivious to data duplication patterns in the workspace, and thread blocks are assigned to streaming multiprocessors (SMs) irrespective of data similarity between GEMM tiles. Such scheduling misses a significant opportunity to exploit data locality manifested in the DNN convolution. This article proposes a GPU scheduling technique called Locality-Aware Scheduling (LAS) that i) identifies which thread blocks share the largest amount of identical data based on the lowered patterns of a DNN convolution and ii) allocates such thread blocks showing the greatest data similarity to the same SM. In this way, small caches in SMs can efficiently utilize the data locality of the DNN convolution. Experimental results show that LAS with tensor cores achieves 20.1% performance improvements on average with 14.8% increases in L1 cache hit rates.
ISSN:1045-9219
1558-2183
DOI:10.1109/TPDS.2023.3247808