A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN Training

The transformer-based deep neural network (DNN) models have shown considerable success across diverse tasks, prompting widespread adoption of distributed training methods such as data parallelism and pipeline parallelism. With the increasing parameter number, hybrid parallel training becomes imperat...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on parallel and distributed systems 2024-08, Vol.35 (8), p.1415-1428
Hauptverfasser:	Li, Shengwei, Lu, Kai, Lai, Zhiquan, Liu, Weijie, Ge, Keshi, Li, Dongsheng
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Communication communication scheduling Computation Computational modeling Data models deep learning Distributed training hybrid parallelism Parallel processing Pipeline processing Pipelines Pipelining (computers) Processor scheduling Real time Scheduling Topology Training Transformers
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The transformer-based deep neural network (DNN) models have shown considerable success across diverse tasks, prompting widespread adoption of distributed training methods such as data parallelism and pipeline parallelism. With the increasing parameter number, hybrid parallel training becomes imperative to scale training. The primary bottleneck in scaling remains the communication overhead. The communication scheduling technique, emphasizing the overlap of communication with computation, has demonstrated its benefits in scaling. However, most existing works focus on data parallelism, overlooking the nuances of hybrid parallel training. In this paper, we propose TriRace , an efficient communication scheduling framework for accelerating communications in hybrid parallel training of asynchronous pipeline parallelism and data parallelism. To achieve effective computation-communication overlap, TriRace introduces 3D communication scheduling , which adeptly leverages data dependencies between communication and computations, efficiently scheduling AllReduce communication, sparse communication, and peer-to-peer communication in hybrid parallel training. To avoid possible communication contentions, TriRace also incorporates a topology-aware runtime which optimizes the execution of communication operations by considering ongoing communication operations and real-time network status. We have implemented a prototype of TriRace based on PyTorch and Pipedream-2BW, and conducted comprehensive evaluations with three representative baselines. Experimental results show that TriRace achieves up to 1.07-1.45× speedup compared to the state-of-the-art pipeline parallelism training baseline Pipedream-2BW, and 1.24-1.81× speedup compared to the Megatron.
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2024.3406420