Distributed deep learning system for cancerous region detection on Sunway TaihuLight

To explore the potential of distributed training on deep neural networks, we implement several distributed algorithms with the basis of swFlow on the world-leading supercomputer, Sunway TaihuLight. Based on two naive designs of parameter server and ring all-reduce, we present the limitation of the c...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:CCF transactions on high performance computing (Online) 2020-12, Vol.2 (4), p.348-361
Hauptverfasser: Lv, GuoFeng, Li, MingFan, An, Hong, Lin, Han, Chen, Junshi, Han, Wenting, Xiao, Qian, Wang, Fei, Lin, Rongfen
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:To explore the potential of distributed training on deep neural networks, we implement several distributed algorithms with the basis of swFlow on the world-leading supercomputer, Sunway TaihuLight. Based on two naive designs of parameter server and ring all-reduce, we present the limitation of the communication model and discuss the optimizations for adapting the five-level interconnect architecture of Sunway system. To reduce the communication bottleneck on large scale system, multi-severs and hierarchical ring all-reduce models are introduced. With a benchmark from deep learning-based cancerous region detection algorithm, the average parallel efficiency obtains over 80% for at most 1024 processors. It reveals the great opportunity for joint combination of deep learning and HPC system.
ISSN:2524-4922
2524-4930
DOI:10.1007/s42514-020-00046-5