In-Place Matrix Transposition on GPUs

Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. With more and more algebra libraries offloading to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architecture...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on parallel and distributed systems 2016-03, Vol.27 (3), p.776-788
Hauptverfasser:	Gomez-Luna, Juan, I-Jui Sung, Li-Wen Chang, Gonzalez-Linares, Jose Maria, Guil, Nicolas, Hwu, Wen-Mei W.
Format:	Artikel
Sprache:	eng
Schlagworte:	Algebra Algorithms Arrays Central processing units GPU Graphics processing units In-Place Kernels Layout Libraries Mathematical models Multicore processing Optimization Parallel processing Searching Throughput Transposition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. With more and more algebra libraries offloading to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPU to achieve good performance. In this paper we present our in-place matrix transposition approach for GPUs that is performed using elementary tile-wise transpositions. We propose low-level optimizations for the elementary transpositions, and find the best performing configurations for them. Then, we compare all sequences of transpositions that achieve full transposition, and detect which is the most favorable for each matrix. We present an heuristic to guide the selection of tile sizes, and compare them to brute-force search. We diagnose the drawback of our approach, and propose a solution using minimal padding. With fast padding and unpadding kernels, the overall throughput is significantly increased. Finally, we compare our method to another recent implementation.
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2015.2412549