'Mutual Watch-dog Networking': Distributed Awareness of Faults and Critical Events in Petascale/Exascale systems

Many tile systems require techniques to be applied to increase components resilience and control the FIT (Failures In Time) rate. When scaling to peta- exa-scale systems the FIT rate may become unacceptable due to component numerosity, requiring more systemic countermeasures. Thus, the ability to be...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Ammendola, Roberto, Biagioni, Andrea, Frezza, Ottorino, Cicero, Francesca Lo, Lonardo, Alessandro, Paolucci, Pier Stanislao, Rossetti, Davide, Simula, Francesco, Tosoratto, Laura, Vicini, Piero
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Many tile systems require techniques to be applied to increase components resilience and control the FIT (Failures In Time) rate. When scaling to peta- exa-scale systems the FIT rate may become unacceptable due to component numerosity, requiring more systemic countermeasures. Thus, the ability to be fault aware, i.e. to detect and collect information about fault and critical events, is a necessary feature that large scale distributed architectures must provide in order to apply systemic fault tolerance techniques. In this context, the LO|FA|MO approach is a way to obtain systemic fault awareness, by implementing a mutual watchdog mechanism and guaranteeing fault detection in a no-single-point-of-failure fashion. This document contains specification and implementation details about this approach, in the shape of a technical report.
DOI:10.48550/arxiv.1307.0433