Accelerating Distributed GNN Training by Codes

Emerging graph neural network (GNN) has recently attracted much attention and has been used extensively in many real-world applications thanks to its powerful expression ability of unstructured data. The real-world graph datasets are very large-scale, which can contain up to billions of nodes and te...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on parallel and distributed systems 2023-09, Vol.34 (9), p.2598-2614
Hauptverfasser:	Wang, Yanhong, Guan, Tianchan, Niu, Dimin, Zou, Qiaosha, Zheng, Hongzhong, Shi, C.-J. Richard, Xie, Yuan
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Codes Communication Computational modeling Computer networks Data communication Datasets Distributed databases distributed machine learning GNN Graph neural networks MPI neighbor sampling Performance measurement Run time (computers) Runtime Sampling methods Training Unstructured data
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Emerging graph neural network (GNN) has recently attracted much attention and has been used extensively in many real-world applications thanks to its powerful expression ability of unstructured data. The real-world graph datasets are very large-scale, which can contain up to billions of nodes and tens of billions of edges. It usually requires distributed system to train GNN on such huge datasets. As a result, the data communication overheads between machines become the bottleneck of GNN computation. Our profiling results show that getting attributes from remote machines during sampling phase in GNN occupies > > 75% of the time of the training process. To address this issue, in this article, we propose Coded Neighbor Sampling (CNS) framework, which introduces codes technique to reduce the communication overheads of GNN. In the proposed CNS framework, the codes technique is coupled with GNN sampling method to exploit the data excess among different machines caused by unstructured nature of graph data. An analytical performance model is built for the proposed CNS framework, whose results are corroborated by the simulation and validate the benefit of the proposed CNS framework over both conventional GNN training method and conventional codes technique. Performance metrics, such as communication overheads, runtime, and throughput, of the proposed CNS framework are evaluated on a distributed GNN training simulation system implemented on MPI4py platform. The results show that, on average, the proposed CNS framework can save communication overhead by 40.6%, 35.5%, and 16.5%, reduce the runtime by 12.1%, 17.0%, and 10.0%, and improve the throughput by 16.2%, 24.4%, and 11.2%, respectively, when training GNN models with Cora, PubMed, and Large Taobao.
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2023.3295184