Overlapping Communication With Computation in Parameter Server for Scalable DL Training

Scalability of distributed deep learning (DL) training with parameter server (PS) architecture is often communication constrained in large clusters. There are recent efforts that use a layer by layer strategy to overlap gradient communication with backward computation so as to reduce the impact of c...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on parallel and distributed systems 2021-09, Vol.32 (9), p.2144-2159
Hauptverfasser:	Wang, Shaoqi, Pi, Aidi, Zhou, Xiaobo, Wang, Jun, Xu, Cheng-Zhong
Format:	Artikel
Sprache:	eng
Schlagworte:	backward computation Communication Computation Computational modeling Computer architecture Constraints forward computation gradient communication Greedy algorithms Machine learning Neural networks Optimization parameter communication Parameter server Parameters Partitions Scalability Servers Synchronization Training
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Scalability of distributed deep learning (DL) training with parameter server (PS) architecture is often communication constrained in large clusters. There are recent efforts that use a layer by layer strategy to overlap gradient communication with backward computation so as to reduce the impact of communication constraint on the scalability. However, the approaches could bring significant overhead in gradient communication. Meanwhile, they cannot be effectively applied to the overlap between parameter communication and forward computation. In this article, we propose and develop iPart, a novel approach that partitions communication and computation in various partition sizes to overlap gradient communication with backward computation and parameter communication with forward computation. iPart formulates the partitioning decision as an optimization problem and solves it based on a greedy algorithm to derive communication and computation partitions. We implement iPart in the open-source DL framework BigDL and perform evaluations with various DL workloads. Experimental results show that iPart improves the scalability of a cluster of 72 nodes by up to 94 percent over the default PS and 52 percent over the layer by layer strategy.
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2021.3062721