Efficient Intranode Communication in GPU-Accelerated Systems

Current implementations of MPI are unaware of accelerator memory (i.e., GPU device memory) and require programmers to explicitly move data between memory spaces. This approach is inefficient, especially for intranode communication where it can result in several extra copy operations. In this work, w...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Feng Ji, Aji, A. M., Dinan, J., Buntinas, D., Balaji, P., Wu-chun Feng, Xiaosong Ma
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Bandwidth Computer architecture CUDA GPU Graphics processing unit Intranode communication Manuals MPI MPICH2 Nemesis Performance evaluation Programming Receivers
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Current implementations of MPI are unaware of accelerator memory (i.e., GPU device memory) and require programmers to explicitly move data between memory spaces. This approach is inefficient, especially for intranode communication where it can result in several extra copy operations. In this work, we integrate GPU-awareness into a popular MPI runtime system and develop techniques to significantly reduce the cost of intranode communication involving one or more GPUs. Experiment results show an up to 2x increase in bandwidth, resulting in an average of 4.3% improvement to the total execution time of a halo exchange benchmark.
DOI:	10.1109/IPDPSW.2012.227