Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast

Broadcast is a widely used operation in many streaming and deep learning applications to disseminate large amounts of data on emerging heterogeneous High-Performance Computing (HPC) systems. However, traditional broadcast schemes do not fully utilize hardware features for Graphics Processing Unit (G...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on parallel and distributed systems 2019-03, Vol.30 (3), p.575-588
Hauptverfasser:	Chu, Ching-Hsiang, Lu, Xiaoyi, Awan, Ammar A., Subramoni, Hari, Elton, Bracy, Panda, Dhabaleswar K.
Format:	Artikel
Sprache:	eng
Schlagworte:	Analytical models Bandwidth Benchmarks Broadcast Broadcasting Clustering algorithms deep learning GPU GPUDirect RDMA Graphics processing units Hardware hardware multicast heterogeneous broadcast Machine learning Message passing Multicast Scalability State of the art streaming Workload
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Broadcast is a widely used operation in many streaming and deep learning applications to disseminate large amounts of data on emerging heterogeneous High-Performance Computing (HPC) systems. However, traditional broadcast schemes do not fully utilize hardware features for Graphics Processing Unit (GPU)-based applications. In this paper, a model-oriented analysis is presented to identify performance bottlenecks of existing broadcast schemes on GPU clusters. Next, streaming-based broadcast schemes are proposed to exploit InfiniBand hardware multicast (IB-MCAST) and NVIDIA GPUDirect technology for efficient message transmission. The proposed designs are evaluated in the context of using Message Passing Interface (MPI) based benchmarks and applications. The experimental results indicate improved scalability and up to 82 percent reduction of latency compared to the state-of-the-art solutions in the benchmark-level evaluation. Furthermore, compared to the state-of-the-art, the proposed design yields stable higher throughput for a synthetic streaming workload, and 1.3x faster training time for a deep learning framework.
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2018.2867222