Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects

Heterogeneous high-performance computing systems with GPUs are equipped with high-performance interconnects like InfiniBand, Omni-Path, PCIe, and NVLink. However, little exists in the literature that captures the performance impact of these interconnects on distributed deep learning (DL). In this ar...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE MICRO 2020-01, Vol.40 (1), p.35-43
Hauptverfasser:	Awan, Ammar Ahmad, Jain, Arpan, Chu, Ching-Hsiang, Subramoni, Hari, Panda, Dhableswar K.
Format:	Artikel
Sprache:	eng
Schlagworte:	Communication Libraries Deep learning Depth profiling Distributed computing Graphics processing units Heterogeneous networks Horovod InfiniBand Interconnections Mathematical analysis Middleware MVAPICH2 MPI NVLink Omni-Path PCIe Performance analysis Profiling TensorFlow Tensors Training Training data Workloads
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Heterogeneous high-performance computing systems with GPUs are equipped with high-performance interconnects like InfiniBand, Omni-Path, PCIe, and NVLink. However, little exists in the literature that captures the performance impact of these interconnects on distributed deep learning (DL). In this article, we choose Horovod, a distributed training middleware, to analyze and profile various DNN training workloads using TensorFlow and PyTorch in addition to standard MPI microbenchmarks. We use a wide variety of systems with CPUs like Intel Xeon and IBM POWER9, GPUs like Volta V100, and various interconnects to analyze the following metrics: 1) message-size with Horovod's tensor-fusion; 2) message-size without tensor-fusion; 3) number of MPI/NCCL calls; and 4) time taken by each MPI/NCCL call. We observed extreme performance variations for non-power-of-two message sizes on different platforms. To address this, we design a message-padding scheme for Horovod, illustrate significantly smoother allreduce latency profiles, and report cases where we observed improvement for end-to-end training.
ISSN:	0272-1732 1937-4143
DOI:	10.1109/MM.2019.2949986