Coordinated checkpoint versus message log for fault tolerant MPI

MPI is one of the most adopted programming models for large clusters and grid deployments. However, these systems often suffer from network or node failures. This raises the issue of selecting a fault tolerance approach for MPI. Automatic and transparent ones are based on either coordinated checkpoi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Bouteiller, Lemarinier, Krawezik, Capello
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Computer fault tolerance Message passing System recovery
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	250
container_issue
container_start_page	242
container_title
container_volume
creator	Bouteiller Lemarinier Krawezik Capello
description	MPI is one of the most adopted programming models for large clusters and grid deployments. However, these systems often suffer from network or node failures. This raises the issue of selecting a fault tolerance approach for MPI. Automatic and transparent ones are based on either coordinated checkpointing or message logging associated with uncoordinated checkpoint. There are many protocols, implementations and optimizations for these approaches but few results about their comparison. Coordinated checkpoint has the advantage of a very low overhead on fault free executions. In contrary a message logging protocol systematically adds a significant message transfer penalty. The drawbacks of coordinated checkpoint come from its synchronization cost at checkpoint and restart times. In this paper we implement, evaluate and compare the two kinds of protocols with a special emphasis on their respective performance according to fault frequency. The main conclusion (under our experimental conditions) is that message logging becomes relevant for a large scale cluster from one fault every hour for applications with large dataset.
doi_str_mv	10.1109/CLUSTR.2003.1253321
format	Conference Proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_1253321</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>1253321</ieee_id><sourcerecordid>1253321</sourcerecordid><originalsourceid>FETCH-LOGICAL-i175t-af8dbb682cef20aa71645890abecc0b76921cdf7c3232238be0ca5c338da59793</originalsourceid><addsrcrecordid>eNotj81KAzEUhQMiKHWeoJu8wIw3STNJdsrgT6GiaLsudzI3dXTalGQq-PYO2LM5m4_DdxibC6iEAHfbrDYf6_dKAqhKSK2UFBescMaCqZ2WUNfmihU5f8GUhRbaymt218SYuv6AI3Xcf5L_Psb-MPIfSvmU-Z5yxh3xIe54iIkHPA0jH-NACSfq5W15wy4DDpmKc8_Y5vFh3TyXq9enZXO_Knth9FhisF3b1lZ6ChIQjagX2jrAlryHdjKUwnfBeCWVlMq2BB61V8p2qJ1xasbm_7s9EW2Pqd9j-t2ef6o_f2dJcQ</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Coordinated checkpoint versus message log for fault tolerant MPI</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Bouteiller ; Lemarinier ; Krawezik ; Capello</creator><creatorcontrib>Bouteiller ; Lemarinier ; Krawezik ; Capello</creatorcontrib><description>MPI is one of the most adopted programming models for large clusters and grid deployments. However, these systems often suffer from network or node failures. This raises the issue of selecting a fault tolerance approach for MPI. Automatic and transparent ones are based on either coordinated checkpointing or message logging associated with uncoordinated checkpoint. There are many protocols, implementations and optimizations for these approaches but few results about their comparison. Coordinated checkpoint has the advantage of a very low overhead on fault free executions. In contrary a message logging protocol systematically adds a significant message transfer penalty. The drawbacks of coordinated checkpoint come from its synchronization cost at checkpoint and restart times. In this paper we implement, evaluate and compare the two kinds of protocols with a special emphasis on their respective performance according to fault frequency. The main conclusion (under our experimental conditions) is that message logging becomes relevant for a large scale cluster from one fault every hour for applications with large dataset.</description><identifier>ISBN: 9780769520667</identifier><identifier>ISBN: 0769520669</identifier><identifier>DOI: 10.1109/CLUSTR.2003.1253321</identifier><language>eng</language><publisher>IEEE</publisher><subject>Computer fault tolerance ; Message passing ; System recovery</subject><ispartof>2003 Proceedings IEEE International Conference on Cluster Computing, 2003, p.242-250</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/1253321$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,4050,4051,27925,54920</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/1253321$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Bouteiller</creatorcontrib><creatorcontrib>Lemarinier</creatorcontrib><creatorcontrib>Krawezik</creatorcontrib><creatorcontrib>Capello</creatorcontrib><title>Coordinated checkpoint versus message log for fault tolerant MPI</title><title>2003 Proceedings IEEE International Conference on Cluster Computing</title><addtitle>CLUSTR</addtitle><description>MPI is one of the most adopted programming models for large clusters and grid deployments. However, these systems often suffer from network or node failures. This raises the issue of selecting a fault tolerance approach for MPI. Automatic and transparent ones are based on either coordinated checkpointing or message logging associated with uncoordinated checkpoint. There are many protocols, implementations and optimizations for these approaches but few results about their comparison. Coordinated checkpoint has the advantage of a very low overhead on fault free executions. In contrary a message logging protocol systematically adds a significant message transfer penalty. The drawbacks of coordinated checkpoint come from its synchronization cost at checkpoint and restart times. In this paper we implement, evaluate and compare the two kinds of protocols with a special emphasis on their respective performance according to fault frequency. The main conclusion (under our experimental conditions) is that message logging becomes relevant for a large scale cluster from one fault every hour for applications with large dataset.</description><subject>Computer fault tolerance</subject><subject>Message passing</subject><subject>System recovery</subject><isbn>9780769520667</isbn><isbn>0769520669</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2003</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNotj81KAzEUhQMiKHWeoJu8wIw3STNJdsrgT6GiaLsudzI3dXTalGQq-PYO2LM5m4_DdxibC6iEAHfbrDYf6_dKAqhKSK2UFBescMaCqZ2WUNfmihU5f8GUhRbaymt218SYuv6AI3Xcf5L_Psb-MPIfSvmU-Z5yxh3xIe54iIkHPA0jH-NACSfq5W15wy4DDpmKc8_Y5vFh3TyXq9enZXO_Knth9FhisF3b1lZ6ChIQjagX2jrAlryHdjKUwnfBeCWVlMq2BB61V8p2qJ1xasbm_7s9EW2Pqd9j-t2ef6o_f2dJcQ</recordid><startdate>2003</startdate><enddate>2003</enddate><creator>Bouteiller</creator><creator>Lemarinier</creator><creator>Krawezik</creator><creator>Capello</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>2003</creationdate><title>Coordinated checkpoint versus message log for fault tolerant MPI</title><author>Bouteiller ; Lemarinier ; Krawezik ; Capello</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i175t-af8dbb682cef20aa71645890abecc0b76921cdf7c3232238be0ca5c338da59793</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2003</creationdate><topic>Computer fault tolerance</topic><topic>Message passing</topic><topic>System recovery</topic><toplevel>online_resources</toplevel><creatorcontrib>Bouteiller</creatorcontrib><creatorcontrib>Lemarinier</creatorcontrib><creatorcontrib>Krawezik</creatorcontrib><creatorcontrib>Capello</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bouteiller</au><au>Lemarinier</au><au>Krawezik</au><au>Capello</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Coordinated checkpoint versus message log for fault tolerant MPI</atitle><btitle>2003 Proceedings IEEE International Conference on Cluster Computing</btitle><stitle>CLUSTR</stitle><date>2003</date><risdate>2003</risdate><spage>242</spage><epage>250</epage><pages>242-250</pages><isbn>9780769520667</isbn><isbn>0769520669</isbn><abstract>MPI is one of the most adopted programming models for large clusters and grid deployments. However, these systems often suffer from network or node failures. This raises the issue of selecting a fault tolerance approach for MPI. Automatic and transparent ones are based on either coordinated checkpointing or message logging associated with uncoordinated checkpoint. There are many protocols, implementations and optimizations for these approaches but few results about their comparison. Coordinated checkpoint has the advantage of a very low overhead on fault free executions. In contrary a message logging protocol systematically adds a significant message transfer penalty. The drawbacks of coordinated checkpoint come from its synchronization cost at checkpoint and restart times. In this paper we implement, evaluate and compare the two kinds of protocols with a special emphasis on their respective performance according to fault frequency. The main conclusion (under our experimental conditions) is that message logging becomes relevant for a large scale cluster from one fault every hour for applications with large dataset.</abstract><pub>IEEE</pub><doi>10.1109/CLUSTR.2003.1253321</doi><tpages>9</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISBN: 9780769520667
ispartof	2003 Proceedings IEEE International Conference on Cluster Computing, 2003, p.242-250
issn
language	eng
recordid	cdi_ieee_primary_1253321
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	Computer fault tolerance Message passing System recovery
title	Coordinated checkpoint versus message log for fault tolerant MPI
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T15%3A38%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Coordinated%20checkpoint%20versus%20message%20log%20for%20fault%20tolerant%20MPI&rft.btitle=2003%20Proceedings%20IEEE%20International%20Conference%20on%20Cluster%20Computing&rft.au=Bouteiller&rft.date=2003&rft.spage=242&rft.epage=250&rft.pages=242-250&rft.isbn=9780769520667&rft.isbn_list=0769520669&rft_id=info:doi/10.1109/CLUSTR.2003.1253321&rft_dat=%3Cieee_6IE%3E1253321%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=1253321&rfr_iscdi=true