A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN Training
The transformer-based deep neural network (DNN) models have shown considerable success across diverse tasks, prompting widespread adoption of distributed training methods such as data parallelism and pipeline parallelism. With the increasing parameter number, hybrid parallel training becomes imperat...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on parallel and distributed systems 2024-08, Vol.35 (8), p.1415-1428 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The transformer-based deep neural network (DNN) models have shown considerable success across diverse tasks, prompting widespread adoption of distributed training methods such as data parallelism and pipeline parallelism. With the increasing parameter number, hybrid parallel training becomes imperative to scale training. The primary bottleneck in scaling remains the communication overhead. The communication scheduling technique, emphasizing the overlap of communication with computation, has demonstrated its benefits in scaling. However, most existing works focus on data parallelism, overlooking the nuances of hybrid parallel training. In this paper, we propose TriRace , an efficient communication scheduling framework for accelerating communications in hybrid parallel training of asynchronous pipeline parallelism and data parallelism. To achieve effective computation-communication overlap, TriRace introduces 3D communication scheduling , which adeptly leverages data dependencies between communication and computations, efficiently scheduling AllReduce communication, sparse communication, and peer-to-peer communication in hybrid parallel training. To avoid possible communication contentions, TriRace also incorporates a topology-aware runtime which optimizes the execution of communication operations by considering ongoing communication operations and real-time network status. We have implemented a prototype of TriRace based on PyTorch and Pipedream-2BW, and conducted comprehensive evaluations with three representative baselines. Experimental results show that TriRace achieves up to 1.07-1.45× speedup compared to the state-of-the-art pipeline parallelism training baseline Pipedream-2BW, and 1.24-1.81× speedup compared to the Megatron. |
---|---|
ISSN: | 1045-9219 1558-2183 |
DOI: | 10.1109/TPDS.2024.3406420 |