TSEngine: Enable Efficient Communication Overlay in Distributed Machine Learning in WANs

In recent years, distributed machine learning in WANs (DML-WANs), i.e., collaboratively training a high-quality ML model cross geo-distributed micro-clouds or edge devices, has attracted attention and been widely applied. Compared with cloud-centric training, DML-WANs avoids the high cost of transfe...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE eTransactions on network and service management 2021-12, Vol.18 (4), p.4846-4859
Hauptverfasser: Zhou, Huaman, Cai, Weibo, Li, Zonghang, Yu, Hongfang, Liu, Ling, Luo, Long, Sun, Gang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In recent years, distributed machine learning in WANs (DML-WANs), i.e., collaboratively training a high-quality ML model cross geo-distributed micro-clouds or edge devices, has attracted attention and been widely applied. Compared with cloud-centric training, DML-WANs avoids the high cost of transferring large amounts of raw data to a central cloud and privacy concerns. However, performing DML-WANs still faces challenges. Model synchronization, an essential step of DML-WANs, is accompanied by a lot of model communication cross limited-bandwidth WANs, which generates high communication overhead. Moreover, the parameter server system, which has been widely used, performs model synchronization in a centralized manner, resulting in serious communication in-cast problem. Such communication in-cast further raises the communication overhead, leading to the low efficiency of DML-WANs. To alleviate the communication in-cast, existing researches attempt to build tree-based communication overlays over the parameter server and workers. However, we identify that these approaches can not adapt to the dynamic and heterogeneous network of DML-WANs, resulting in insufficient improvements. This paper proposes TSEngine, an adaptive communication scheduler for efficient communication overlay of the parameter server system in DML-WANs. Its core idea is to dynamically schedule the communication logic over the parameter server and workers based on the active network perception. Specifically, we propose novel communication scheduling protocols for model distribution and model aggregation, respectively. We have implemented TSEngine in a mainstream parameter server system and verified its effectiveness in DML-WANs testbeds.
ISSN:1932-4537
1932-4537
DOI:10.1109/TNSM.2021.3106315