MATCH: An MPI Fault Tolerance Benchmark Suite

IEEE International Symposium on Workload Characterization (IISWC 2020) MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI appl...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Guo, Luanzheng, Georgakoudis, Giorgis, Parasyris, Konstantinos, Laguna, Ignacio, Li, Dong
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Distributed, Parallel, and Cluster Computing
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Guo, Luanzheng Georgakoudis, Giorgis Parasyris, Konstantinos Laguna, Ignacio Li, Dong
description	IEEE International Symposium on Workload Characterization (IISWC 2020) MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI- FT- Bench.
doi_str_mv	10.48550/arxiv.2102.06894
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2102_06894</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2102_06894</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-d4d3569354ac33f34d89778463c5587709da5c5a5304b392d4ec11d1c53da9323</originalsourceid><addsrcrecordid>eNotzkFuwjAQhWFvWCDoAVjhCyS1MzOxzS6gQpFARSL7aLCNiAihSqFqbw-Frt7qf_qEGGmVoiVSr9z91N9pplWWqtw67ItkXZSz94ksWrneLOWcr81Flucmdtz6KKex9YcTd0e5vdaXOBS9PTdf8eV_B6Kcv937ZPWxWM6KVcK5wSRgAModELIH2AMG64yxmIMnssYoF5g8MYHCHbgsYPRaB-0JAjvIYCDGz9uHt_rs6jvht_pzVw833ADMMjn6</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MATCH: An MPI Fault Tolerance Benchmark Suite</title><source>arXiv.org</source><creator>Guo, Luanzheng ; Georgakoudis, Giorgis ; Parasyris, Konstantinos ; Laguna, Ignacio ; Li, Dong</creator><creatorcontrib>Guo, Luanzheng ; Georgakoudis, Giorgis ; Parasyris, Konstantinos ; Laguna, Ignacio ; Li, Dong</creatorcontrib><description>IEEE International Symposium on Workload Characterization (IISWC 2020) MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI- FT- Bench.</description><identifier>DOI: 10.48550/arxiv.2102.06894</identifier><language>eng</language><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><creationdate>2021-02</creationdate><rights>http://creativecommons.org/licenses/by-nc-nd/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2102.06894$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2102.06894$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Guo, Luanzheng</creatorcontrib><creatorcontrib>Georgakoudis, Giorgis</creatorcontrib><creatorcontrib>Parasyris, Konstantinos</creatorcontrib><creatorcontrib>Laguna, Ignacio</creatorcontrib><creatorcontrib>Li, Dong</creatorcontrib><title>MATCH: An MPI Fault Tolerance Benchmark Suite</title><description>IEEE International Symposium on Workload Characterization (IISWC 2020) MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI- FT- Bench.</description><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzkFuwjAQhWFvWCDoAVjhCyS1MzOxzS6gQpFARSL7aLCNiAihSqFqbw-Frt7qf_qEGGmVoiVSr9z91N9pplWWqtw67ItkXZSz94ksWrneLOWcr81Flucmdtz6KKex9YcTd0e5vdaXOBS9PTdf8eV_B6Kcv937ZPWxWM6KVcK5wSRgAModELIH2AMG64yxmIMnssYoF5g8MYHCHbgsYPRaB-0JAjvIYCDGz9uHt_rs6jvht_pzVw833ADMMjn6</recordid><startdate>20210213</startdate><enddate>20210213</enddate><creator>Guo, Luanzheng</creator><creator>Georgakoudis, Giorgis</creator><creator>Parasyris, Konstantinos</creator><creator>Laguna, Ignacio</creator><creator>Li, Dong</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210213</creationdate><title>MATCH: An MPI Fault Tolerance Benchmark Suite</title><author>Guo, Luanzheng ; Georgakoudis, Giorgis ; Parasyris, Konstantinos ; Laguna, Ignacio ; Li, Dong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-d4d3569354ac33f34d89778463c5587709da5c5a5304b392d4ec11d1c53da9323</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><toplevel>online_resources</toplevel><creatorcontrib>Guo, Luanzheng</creatorcontrib><creatorcontrib>Georgakoudis, Giorgis</creatorcontrib><creatorcontrib>Parasyris, Konstantinos</creatorcontrib><creatorcontrib>Laguna, Ignacio</creatorcontrib><creatorcontrib>Li, Dong</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Guo, Luanzheng</au><au>Georgakoudis, Giorgis</au><au>Parasyris, Konstantinos</au><au>Laguna, Ignacio</au><au>Li, Dong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MATCH: An MPI Fault Tolerance Benchmark Suite</atitle><date>2021-02-13</date><risdate>2021</risdate><abstract>IEEE International Symposium on Workload Characterization (IISWC 2020) MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI- FT- Bench.</abstract><doi>10.48550/arxiv.2102.06894</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2102.06894
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2102_06894
source	arXiv.org
subjects	Computer Science - Distributed, Parallel, and Cluster Computing
title	MATCH: An MPI Fault Tolerance Benchmark Suite
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T06%3A04%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MATCH:%20An%20MPI%20Fault%20Tolerance%20Benchmark%20Suite&rft.au=Guo,%20Luanzheng&rft.date=2021-02-13&rft_id=info:doi/10.48550/arxiv.2102.06894&rft_dat=%3Carxiv_GOX%3E2102_06894%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true