'Mutual Watch-dog Networking': Distributed Awareness of Faults and Critical Events in Petascale/Exascale systems
Many tile systems require techniques to be applied to increase components resilience and control the FIT (Failures In Time) rate. When scaling to peta- exa-scale systems the FIT rate may become unacceptable due to component numerosity, requiring more systemic countermeasures. Thus, the ability to be...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Many tile systems require techniques to be applied to increase components
resilience and control the FIT (Failures In Time) rate. When scaling to peta-
exa-scale systems the FIT rate may become unacceptable due to component
numerosity, requiring more systemic countermeasures. Thus, the ability to be
fault aware, i.e. to detect and collect information about fault and critical
events, is a necessary feature that large scale distributed architectures must
provide in order to apply systemic fault tolerance techniques. In this context,
the LO|FA|MO approach is a way to obtain systemic fault awareness, by
implementing a mutual watchdog mechanism and guaranteeing fault detection in a
no-single-point-of-failure fashion. This document contains specification and
implementation details about this approach, in the shape of a technical report. |
---|---|
DOI: | 10.48550/arxiv.1307.0433 |