Energy-efficient localised rollback after failures via data flow analysis
Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weak...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Exascale systems will suffer failures hourly. HPC programmers rely mostly on
application-level checkpoint and a global rollback to recover. In recent years,
techniques reducing the number of rolling back processes have been implemented
via message logging. However, the log-based approaches have weaknesses, such as
being dependent on complex modifications within an MPI implementation, and the
fact that a full restart may be required in the general case. To address the
limitations of all log-based mechanisms, we return to checkpoint-only
mechanisms, but advocate data-flow-driven recovery (DFR), a fundamentally
different approach relying on analysis of the data flow of iterative codes, and
the well-known concept of data-flow graphs. We demonstrate the effectiveness of
DFR for an MPI stencil code to optimise rollback and reduce the overall energy
consumption by 10-12 % on idling nodes during localised rollback. We also
provide large-scale estimates for the energy savings of DFR compared to global
rollback, which for stencil codes increase as n square for a process count n. |
---|---|
DOI: | 10.48550/arxiv.1806.01611 |