An analytical GPU performance model for 3D stencil computations from the angle of data traffic

The achievable GPU performance of many scientific computations is not determined by a GPU’s peak floating-point rate, but rather how fast data are moved through different stages of the entire memory hierarchy. We take low-order 3D stencil computations as a representative class to study the reachable...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:The Journal of supercomputing 2015-07, Vol.71 (7), p.2433-2453
Hauptverfasser: Su, Huayou, Cai, Xing, Wen, Mei, Zhang, Chunyuan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The achievable GPU performance of many scientific computations is not determined by a GPU’s peak floating-point rate, but rather how fast data are moved through different stages of the entire memory hierarchy. We take low-order 3D stencil computations as a representative class to study the reachable GPU performance from the angle of data traffic. Specifically, we propose a simple analytical model to estimate the execution time based on quantifying the data traffic volume at three stages: (1) between registers and on-streaming multiprocessor (SMX) storage, (2) between on-SMX storage and L2 cache, (3) between L2 cache and GPU’s device memory. Three associated granularities are used: a CUDA thread, a thread block, and a set of simultaneously active thread blocks. For four chosen 3D stencil computations, NVIDIA’s profiling tools are used to verify the accuracy of the quantified data traffic volumes, by examining a large number of executions with different problem sizes and thread block configurations. Moreover, by introducing an imbalance coefficient, together with the known realistic memory bandwidths, we can predict the execution time usage based on the quantified data traffic volumes. For the four 3D stencils, the average error of the time predictions is 6.9 % for a baseline implementation approach, whereas for a blocking implementation approach the average prediction error is 9.5 %.
ISSN:0920-8542
1573-0484
DOI:10.1007/s11227-015-1392-1