Resiliency in Numerical Algorithm Design for Extreme Scale Simulations
This work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors. Naive versions of conventional resilience techniques will not scale to the exascale regime: wi...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This work is based on the seminar titled ``Resiliency in Numerical Algorithm
Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss
Dagstuhl, that was attended by all the authors.
Naive versions of conventional resilience techniques will not scale to the
exascale regime: with a main memory footprint of tens of Petabytes,
synchronously writing checkpoint data all the way to background storage at
frequent intervals will create intolerable overheads in runtime and energy
consumption. Forecasts show that the mean time between failures could be lower
than the time to recover from such a checkpoint, so that large calculations at
scale might not make any progress if robust alternatives are not investigated.
More advanced resilience techniques must be devised. The key may lie in
exploiting both advanced system features as well as specific application
knowledge. Research will face two essential questions: (1) what are the
reliability requirements for a particular computation and (2) how do we best
design the algorithms and software to meet these requirements? One avenue would
be to refine and improve on system- or application-level checkpointing and
rollback strategies in the case an error is detected. Developers might use
fault notification interfaces and flexible runtime systems to respond to node
failures in an application-dependent fashion. Novel numerical algorithms or
more stochastic computational approaches may be required to meet accuracy
requirements in the face of undetectable soft errors.
The goal of this Dagstuhl Seminar was to bring together a diverse group of
scientists with expertise in exascale computing to discuss novel ways to make
applications resilient against detected and undetected faults. In particular,
participants explored the role that algorithms and applications play in the
holistic approach needed to tackle this challenge. |
---|---|
DOI: | 10.48550/arxiv.2010.13342 |