Iaso: an autonomous fault-tolerant management system for supercomputers

With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the "reliability wall", whi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Frontiers of Computer Science 2014-06, Vol.8 (3), p.378-390
Hauptverfasser: LU, Kai, WANG, Xiaoping, LI, Gen, WANG, Ruibo, CHI, Wanqing, LIU, Yongpeng, TANG, Hongwei, FENG, Hua, GAO, Yinghui
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the "reliability wall", which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay- 2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.
ISSN:2095-2228
2095-2236
DOI:10.1007/s11704-014-3503-1