Communication Optimization Algorithms for Distributed Deep Learning Systems: A Survey

Deep learning's widespread adoption in various fields has made distributed training across multiple computing nodes essential. However, frequent communication between nodes can significantly slow down training speed, creating a bottleneck in distributed training. To address this issue, research...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on parallel and distributed systems 2023-12, Vol.34 (12), p.3294-3308
Hauptverfasser: Yu, Enda, Dong, Dezun, Liao, Xiangke
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Deep learning's widespread adoption in various fields has made distributed training across multiple computing nodes essential. However, frequent communication between nodes can significantly slow down training speed, creating a bottleneck in distributed training. To address this issue, researchers are focusing on communication optimization algorithms for distributed deep learning systems. In this paper, we propose a standard that systematically classifies all communication optimization algorithms based on mathematical modeling, which is not achieved by existing surveys in the field. We categorize existing works into four categories based on the optimization strategies of communication: communication masking, communication compression, communication frequency reduction, and hybrid optimization. Finally, we discuss potential future challenges and research directions in the field of communication optimization algorithms for distributed deep learning systems.
ISSN:1045-9219
1558-2183
DOI:10.1109/TPDS.2023.3323282