From Sancus to Sancus $$^q$$: staleness and quantization-aware full-graph decentralized training in graph neural networks

Graph neural networks (GNNs) have emerged due to their success at modeling graph data. Yet, it is challenging for GNNs to efficiently scale to large graphs. Thus, distributed GNNs come into play. To avoid communication caused by expensive data movement between workers, we propose S ancus and its adv...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:The VLDB journal 2025-03, Vol.34 (2), Article 22
Hauptverfasser: Peng, Jingshu, Liu, Qiyu, Chen, Zhao, Shao, Yingxia, Shen, Yanyan, Chen, Lei, Cao, Jiannong
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Graph neural networks (GNNs) have emerged due to their success at modeling graph data. Yet, it is challenging for GNNs to efficiently scale to large graphs. Thus, distributed GNNs come into play. To avoid communication caused by expensive data movement between workers, we propose S ancus and its advanced version S ancus "Equation missing" , the staleness and quantization-aware communication-avoiding decentralized GNN system. By introducing a set of novel bounded embedding staleness metrics and adaptively skipping broadcasts, S ancus abstracts decentralized GNN processing as sequential matrix multiplication and uses historical embeddings via cache. To further mitigate the communication volume, S ancus "Equation missing" conducts quantization-aware communication on embeddings to reduce the size of broadcast messages. Theoretically, we show bounded approximation errors of embeddings and gradients with a known fastest convergence guarantee. Empirically, we evaluate S ancus and S ancus "Equation missing" with common GNN models via different system setups on large-scale benchmark datasets. Compared to SOTA works, S ancus "Equation missing" can avoid up to $$86\%$$ 86 % communication with $$3.0\times $$ 3.0 × faster throughput on average without accuracy loss.
ISSN:1066-8888
0949-877X
DOI:10.1007/s00778-024-00897-2