MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Bouteiller, Aurélien, Cappello, Franck, Herault, Thomas, Krawezik, Géraud, Lemarinier, Pierre, Magniette, Frédéric
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 25
container_issue
container_start_page 25
container_title
container_volume
creator Bouteiller, Aurélien
Cappello, Franck
Herault, Thomas
Krawezik, Géraud
Lemarinier, Pierre
Magniette, Frédéric
description Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1.
doi_str_mv 10.1145/1048935.1050176
format Conference Proceeding
fullrecord <record><control><sourceid>proquest_6IE</sourceid><recordid>TN_cdi_ieee_primary_1592928</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>1592928</ieee_id><sourcerecordid>31294876</sourcerecordid><originalsourceid>FETCH-LOGICAL-a239t-42aa59c3ccc6aac659db4392ada2c5883ec239f63ce16e7d556c592af1273e513</originalsourceid><addsrcrecordid>eNqNkLtOxDAQRS0hpEVLagp6RJPg8Xj8KNGKx0qLoABay3EcKZCQJd4t-HuMkg9gminO0S0OYxfAKwBJN8ClsUgVcOKg1QkrrDZABgCVJVixIqUPnk8ioRZnbPX0st08lu_inJ22vk-xWP6avd3fvWa0e37Ybm53pRdoD6UU3pMNGEJQ3gdFtqklWuEbLwIZgzFkr1UYIqioGyIVKOMWhMZIgGt2Ne_up_H7GNPBDV0Kse_9VxyPySEIK41WWbycxS7G6PZTN_jpx0Ees8Jkej1THwZXj-NncsDdXwO3NHBLg6xW_1RdPXWxxV9F6VfS</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype><pqid>31294876</pqid></control><display><type>conference_proceeding</type><title>MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Bouteiller, Aurélien ; Cappello, Franck ; Herault, Thomas ; Krawezik, Géraud ; Lemarinier, Pierre ; Magniette, Frédéric</creator><creatorcontrib>Bouteiller, Aurélien ; Cappello, Franck ; Herault, Thomas ; Krawezik, Géraud ; Lemarinier, Pierre ; Magniette, Frédéric</creatorcontrib><description>Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1.</description><identifier>ISBN: 9781581136951</identifier><identifier>ISBN: 1581136951</identifier><identifier>DOI: 10.1145/1048935.1050176</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Checkpointing ; Clocks ; Computer systems organization ; Computer systems organization -- Architectures ; Computer systems organization -- Architectures -- Distributed architectures ; Costs ; Fault tolerance ; General and reference ; High performance computing ; Message passing ; Permission ; Programming profession ; Protocols ; Software and its engineering ; Software and its engineering -- Software organization and properties ; Software and its engineering -- Software organization and properties -- Extra-functional properties ; Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance ; Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance -- Checkpoint -- restart ; Software and its engineering -- Software organization and properties -- Software system structures ; Software and its engineering -- Software organization and properties -- Software system structures -- Distributed systems organizing principles ; Uniform resource locators</subject><ispartof>ACM/IEEE SC 2003 Conference (SC'03), 2003, p.25-25</ispartof><rights>2003 ACM</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a239t-42aa59c3ccc6aac659db4392ada2c5883ec239f63ce16e7d556c592af1273e513</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/1592928$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,4050,4051,27925,54920</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/1592928$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Bouteiller, Aurélien</creatorcontrib><creatorcontrib>Cappello, Franck</creatorcontrib><creatorcontrib>Herault, Thomas</creatorcontrib><creatorcontrib>Krawezik, Géraud</creatorcontrib><creatorcontrib>Lemarinier, Pierre</creatorcontrib><creatorcontrib>Magniette, Frédéric</creatorcontrib><title>MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging</title><title>ACM/IEEE SC 2003 Conference (SC'03)</title><addtitle>SUPERC</addtitle><description>Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1.</description><subject>Checkpointing</subject><subject>Clocks</subject><subject>Computer systems organization</subject><subject>Computer systems organization -- Architectures</subject><subject>Computer systems organization -- Architectures -- Distributed architectures</subject><subject>Costs</subject><subject>Fault tolerance</subject><subject>General and reference</subject><subject>High performance computing</subject><subject>Message passing</subject><subject>Permission</subject><subject>Programming profession</subject><subject>Protocols</subject><subject>Software and its engineering</subject><subject>Software and its engineering -- Software organization and properties</subject><subject>Software and its engineering -- Software organization and properties -- Extra-functional properties</subject><subject>Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance</subject><subject>Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance -- Checkpoint -- restart</subject><subject>Software and its engineering -- Software organization and properties -- Software system structures</subject><subject>Software and its engineering -- Software organization and properties -- Software system structures -- Distributed systems organizing principles</subject><subject>Uniform resource locators</subject><isbn>9781581136951</isbn><isbn>1581136951</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2003</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNqNkLtOxDAQRS0hpEVLagp6RJPg8Xj8KNGKx0qLoABay3EcKZCQJd4t-HuMkg9gminO0S0OYxfAKwBJN8ClsUgVcOKg1QkrrDZABgCVJVixIqUPnk8ioRZnbPX0st08lu_inJ22vk-xWP6avd3fvWa0e37Ybm53pRdoD6UU3pMNGEJQ3gdFtqklWuEbLwIZgzFkr1UYIqioGyIVKOMWhMZIgGt2Ne_up_H7GNPBDV0Kse_9VxyPySEIK41WWbycxS7G6PZTN_jpx0Ees8Jkej1THwZXj-NncsDdXwO3NHBLg6xW_1RdPXWxxV9F6VfS</recordid><startdate>20031115</startdate><enddate>20031115</enddate><creator>Bouteiller, Aurélien</creator><creator>Cappello, Franck</creator><creator>Herault, Thomas</creator><creator>Krawezik, Géraud</creator><creator>Lemarinier, Pierre</creator><creator>Magniette, Frédéric</creator><general>ACM</general><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20031115</creationdate><title>MPICH-V2</title><author>Bouteiller, Aurélien ; Cappello, Franck ; Herault, Thomas ; Krawezik, Géraud ; Lemarinier, Pierre ; Magniette, Frédéric</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a239t-42aa59c3ccc6aac659db4392ada2c5883ec239f63ce16e7d556c592af1273e513</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2003</creationdate><topic>Checkpointing</topic><topic>Clocks</topic><topic>Computer systems organization</topic><topic>Computer systems organization -- Architectures</topic><topic>Computer systems organization -- Architectures -- Distributed architectures</topic><topic>Costs</topic><topic>Fault tolerance</topic><topic>General and reference</topic><topic>High performance computing</topic><topic>Message passing</topic><topic>Permission</topic><topic>Programming profession</topic><topic>Protocols</topic><topic>Software and its engineering</topic><topic>Software and its engineering -- Software organization and properties</topic><topic>Software and its engineering -- Software organization and properties -- Extra-functional properties</topic><topic>Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance</topic><topic>Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance -- Checkpoint -- restart</topic><topic>Software and its engineering -- Software organization and properties -- Software system structures</topic><topic>Software and its engineering -- Software organization and properties -- Software system structures -- Distributed systems organizing principles</topic><topic>Uniform resource locators</topic><toplevel>online_resources</toplevel><creatorcontrib>Bouteiller, Aurélien</creatorcontrib><creatorcontrib>Cappello, Franck</creatorcontrib><creatorcontrib>Herault, Thomas</creatorcontrib><creatorcontrib>Krawezik, Géraud</creatorcontrib><creatorcontrib>Lemarinier, Pierre</creatorcontrib><creatorcontrib>Magniette, Frédéric</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bouteiller, Aurélien</au><au>Cappello, Franck</au><au>Herault, Thomas</au><au>Krawezik, Géraud</au><au>Lemarinier, Pierre</au><au>Magniette, Frédéric</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging</atitle><btitle>ACM/IEEE SC 2003 Conference (SC'03)</btitle><stitle>SUPERC</stitle><date>2003-11-15</date><risdate>2003</risdate><spage>25</spage><epage>25</epage><pages>25-25</pages><isbn>9781581136951</isbn><isbn>1581136951</isbn><abstract>Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/1048935.1050176</doi><tpages>1</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISBN: 9781581136951
ispartof ACM/IEEE SC 2003 Conference (SC'03), 2003, p.25-25
issn
language eng
recordid cdi_ieee_primary_1592928
source IEEE Electronic Library (IEL) Conference Proceedings
subjects Checkpointing
Clocks
Computer systems organization
Computer systems organization -- Architectures
Computer systems organization -- Architectures -- Distributed architectures
Costs
Fault tolerance
General and reference
High performance computing
Message passing
Permission
Programming profession
Protocols
Software and its engineering
Software and its engineering -- Software organization and properties
Software and its engineering -- Software organization and properties -- Extra-functional properties
Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance
Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance -- Checkpoint -- restart
Software and its engineering -- Software organization and properties -- Software system structures
Software and its engineering -- Software organization and properties -- Software system structures -- Distributed systems organizing principles
Uniform resource locators
title MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T16%3A02%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=MPICH-V2:%20a%20Fault%20Tolerant%20MPI%20for%20Volatile%20Nodes%20based%20on%20Pessimistic%20Sender%20Based%20Message%20Logging&rft.btitle=ACM/IEEE%20SC%202003%20Conference%20(SC'03)&rft.au=Bouteiller,%20Aur%C3%A9lien&rft.date=2003-11-15&rft.spage=25&rft.epage=25&rft.pages=25-25&rft.isbn=9781581136951&rft.isbn_list=1581136951&rft_id=info:doi/10.1145/1048935.1050176&rft_dat=%3Cproquest_6IE%3E31294876%3C/proquest_6IE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=31294876&rft_id=info:pmid/&rft_ieee_id=1592928&rfr_iscdi=true