MATCH: An MPI Fault Tolerance Benchmark Suite
IEEE International Symposium on Workload Characterization (IISWC 2020) MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI appl...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | IEEE International Symposium on Workload Characterization (IISWC
2020) MPI has been ubiquitously deployed in flagship HPC systems aiming to
accelerate distributed scientific applications running on tens of hundreds of
processes and compute nodes. Maintaining the correctness and integrity of MPI
application execution is critical, especially for safety-critical scientific
applications. Therefore, a collection of effective MPI fault tolerance
techniques have been proposed to enable MPI application execution to
efficiently resume from system failures. However, there is no structured way to
study and compare different MPI fault tolerance designs, so to guide the
selection and development of efficient MPI fault tolerance techniques for
distinct scenarios. To solve this problem, we design, develop, and evaluate a
benchmark suite called MATCH to characterize, research, and comprehensively
compare different combinations and configurations of MPI fault tolerance
designs. Our investigation derives useful findings: (1) Reinit recovery in
general performs better than ULFM recovery; (2) Reinit recovery is independent
of the scaling size and the input problem size, whereas ULFM recovery is not;
(3) Using Reinit recovery with FTI checkpointing is a highly efficient fault
tolerance design. MATCH code is available at https://github.com/kakulo/MPI- FT-
Bench. |
---|---|
DOI: | 10.48550/arxiv.2102.06894 |