Jdebug: A Fast, Non-Intrusive and Scalable Fault Locating Tool for Ten-Million-Scale Parallel Applications

This article presents Jdebug, a fast, non-intrusive and scalable fault locating tool for extreme-scale parallel applications. Large-scale debugging has drawn more attention with the increasing scale of supercomputers and applications. To eliminate program intrusion caused by traditional instrumentat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on parallel and distributed systems 2022-12, Vol.33 (12), p.3491-3504
Hauptverfasser: Peng, Dajia, Feng, Yunlong, Liu, Yong, Liu, Xin, Xue, Wei, Chen, Dexun, Song, Jiawei, Chen, Zuoning
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This article presents Jdebug, a fast, non-intrusive and scalable fault locating tool for extreme-scale parallel applications. Large-scale debugging has drawn more attention with the increasing scale of supercomputers and applications. To eliminate program intrusion caused by traditional instrumentation or interception during debugging information acquisition, we introduce the out-of-band management into large-scale debugging. We propose a rapid information gathering scheme that separates user and debugging traffic to solve scalability problem and to eliminate program interference during merging data. Observations of Program Counters (PC) and performance characteristics in suspended applications find abnormalities and help locate abnormal threads caused by software errors or hardware failures effectively. Evaluation shows that Jdebug collects PCs of over 20 million cores on the new Sunway supercomputer within 1.97 seconds, and can locate the abnormal threads in 1.4 seconds with an accuracy of 92.5%. In the running test of three fundamental benchmarks (HPL, HPCG, Graph500) and seventeen real-world applications, Jdebug quickly and accurately locates abnormal threads to help find scalability errors and hardware failures including memory access failures, communication failures, and execution component failures, which validates its effectiveness.
ISSN:1045-9219
1558-2183
DOI:10.1109/TPDS.2022.3157690