Fine-grained parallelization of lattice QCD kernel routine on GPUs

Simulation time for the classical problem of Lattice Quantum Chromodynamics (Lattice QCD) is dominated by one kernel routine responsible for computing the actions of a Dirac operator. This paper describes an experience in parallelizing this kernel routine. We explore parallelization granularities fo...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of parallel and distributed computing 2008-10, Vol.68 (10), p.1350-1359
Hauptverfasser:	Ibrahim, Khaled Z., Bodin, François, Pène, Olivier
Format:	Artikel
Sprache:	eng
Schlagworte:	Data-parallel computing Domain specific systems General-purpose computing on graphics hardware Graphic processing unit (GPU) High-performance computing Lattice QCD calculations SIMD computer architecture
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Simulation time for the classical problem of Lattice Quantum Chromodynamics (Lattice QCD) is dominated by one kernel routine responsible for computing the actions of a Dirac operator. This paper describes an experience in parallelizing this kernel routine. We explore parallelization granularities for this kernel routine on Graphical Processing Units (GPUs). We show that fine-grained parallelism can outperform coarse-grained parallelization, given that control-flow and communication effects are minimized. We propose two techniques for transforming control-flow-based code to control-free code. We also show how to reduce the communication effect by optimizing for commonly used sequences of calls to this routine. In our implementation on NVIDIA 8800 GTX, we were able to achieve an 8.3x speedup over an SSE2 optimized version on 2.8 GHz Intel Xeon CPU.
ISSN:	0743-7315 1096-0848
DOI:	10.1016/j.jpdc.2008.06.009