MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol t...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: |
Software and its engineering
> Software organization and properties
> Extra-functional properties
> Software fault tolerance
Software and its engineering
> Software organization and properties
> Extra-functional properties
> Software fault tolerance
> Checkpoint
> restart
|
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 25 |
---|---|
container_issue | |
container_start_page | 25 |
container_title | |
container_volume | |
creator | Bouteiller, Aurélien Cappello, Franck Herault, Thomas Krawezik, Géraud Lemarinier, Pierre Magniette, Frédéric |
description | Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1. |
doi_str_mv | 10.1145/1048935.1050176 |
format | Conference Proceeding |
fullrecord | <record><control><sourceid>proquest_6IE</sourceid><recordid>TN_cdi_ieee_primary_1592928</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>1592928</ieee_id><sourcerecordid>31294876</sourcerecordid><originalsourceid>FETCH-LOGICAL-a239t-42aa59c3ccc6aac659db4392ada2c5883ec239f63ce16e7d556c592af1273e513</originalsourceid><addsrcrecordid>eNqNkLtOxDAQRS0hpEVLagp6RJPg8Xj8KNGKx0qLoABay3EcKZCQJd4t-HuMkg9gminO0S0OYxfAKwBJN8ClsUgVcOKg1QkrrDZABgCVJVixIqUPnk8ioRZnbPX0st08lu_inJ22vk-xWP6avd3fvWa0e37Ybm53pRdoD6UU3pMNGEJQ3gdFtqklWuEbLwIZgzFkr1UYIqioGyIVKOMWhMZIgGt2Ne_up_H7GNPBDV0Kse_9VxyPySEIK41WWbycxS7G6PZTN_jpx0Ees8Jkej1THwZXj-NncsDdXwO3NHBLg6xW_1RdPXWxxV9F6VfS</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype><pqid>31294876</pqid></control><display><type>conference_proceeding</type><title>MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Bouteiller, Aurélien ; Cappello, Franck ; Herault, Thomas ; Krawezik, Géraud ; Lemarinier, Pierre ; Magniette, Frédéric</creator><creatorcontrib>Bouteiller, Aurélien ; Cappello, Franck ; Herault, Thomas ; Krawezik, Géraud ; Lemarinier, Pierre ; Magniette, Frédéric</creatorcontrib><description>Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1.</description><identifier>ISBN: 9781581136951</identifier><identifier>ISBN: 1581136951</identifier><identifier>DOI: 10.1145/1048935.1050176</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Checkpointing ; Clocks ; Computer systems organization ; Computer systems organization -- Architectures ; Computer systems organization -- Architectures -- Distributed architectures ; Costs ; Fault tolerance ; General and reference ; High performance computing ; Message passing ; Permission ; Programming profession ; Protocols ; Software and its engineering ; Software and its engineering -- Software organization and properties ; Software and its engineering -- Software organization and properties -- Extra-functional properties ; Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance ; Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance -- Checkpoint -- restart ; Software and its engineering -- Software organization and properties -- Software system structures ; Software and its engineering -- Software organization and properties -- Software system structures -- Distributed systems organizing principles ; Uniform resource locators</subject><ispartof>ACM/IEEE SC 2003 Conference (SC'03), 2003, p.25-25</ispartof><rights>2003 ACM</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a239t-42aa59c3ccc6aac659db4392ada2c5883ec239f63ce16e7d556c592af1273e513</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/1592928$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,4050,4051,27925,54920</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/1592928$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Bouteiller, Aurélien</creatorcontrib><creatorcontrib>Cappello, Franck</creatorcontrib><creatorcontrib>Herault, Thomas</creatorcontrib><creatorcontrib>Krawezik, Géraud</creatorcontrib><creatorcontrib>Lemarinier, Pierre</creatorcontrib><creatorcontrib>Magniette, Frédéric</creatorcontrib><title>MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging</title><title>ACM/IEEE SC 2003 Conference (SC'03)</title><addtitle>SUPERC</addtitle><description>Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1.</description><subject>Checkpointing</subject><subject>Clocks</subject><subject>Computer systems organization</subject><subject>Computer systems organization -- Architectures</subject><subject>Computer systems organization -- Architectures -- Distributed architectures</subject><subject>Costs</subject><subject>Fault tolerance</subject><subject>General and reference</subject><subject>High performance computing</subject><subject>Message passing</subject><subject>Permission</subject><subject>Programming profession</subject><subject>Protocols</subject><subject>Software and its engineering</subject><subject>Software and its engineering -- Software organization and properties</subject><subject>Software and its engineering -- Software organization and properties -- Extra-functional properties</subject><subject>Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance</subject><subject>Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance -- Checkpoint -- restart</subject><subject>Software and its engineering -- Software organization and properties -- Software system structures</subject><subject>Software and its engineering -- Software organization and properties -- Software system structures -- Distributed systems organizing principles</subject><subject>Uniform resource locators</subject><isbn>9781581136951</isbn><isbn>1581136951</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2003</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNqNkLtOxDAQRS0hpEVLagp6RJPg8Xj8KNGKx0qLoABay3EcKZCQJd4t-HuMkg9gminO0S0OYxfAKwBJN8ClsUgVcOKg1QkrrDZABgCVJVixIqUPnk8ioRZnbPX0st08lu_inJ22vk-xWP6avd3fvWa0e37Ybm53pRdoD6UU3pMNGEJQ3gdFtqklWuEbLwIZgzFkr1UYIqioGyIVKOMWhMZIgGt2Ne_up_H7GNPBDV0Kse_9VxyPySEIK41WWbycxS7G6PZTN_jpx0Ees8Jkej1THwZXj-NncsDdXwO3NHBLg6xW_1RdPXWxxV9F6VfS</recordid><startdate>20031115</startdate><enddate>20031115</enddate><creator>Bouteiller, Aurélien</creator><creator>Cappello, Franck</creator><creator>Herault, Thomas</creator><creator>Krawezik, Géraud</creator><creator>Lemarinier, Pierre</creator><creator>Magniette, Frédéric</creator><general>ACM</general><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20031115</creationdate><title>MPICH-V2</title><author>Bouteiller, Aurélien ; Cappello, Franck ; Herault, Thomas ; Krawezik, Géraud ; Lemarinier, Pierre ; Magniette, Frédéric</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a239t-42aa59c3ccc6aac659db4392ada2c5883ec239f63ce16e7d556c592af1273e513</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2003</creationdate><topic>Checkpointing</topic><topic>Clocks</topic><topic>Computer systems organization</topic><topic>Computer systems organization -- Architectures</topic><topic>Computer systems organization -- Architectures -- Distributed architectures</topic><topic>Costs</topic><topic>Fault tolerance</topic><topic>General and reference</topic><topic>High performance computing</topic><topic>Message passing</topic><topic>Permission</topic><topic>Programming profession</topic><topic>Protocols</topic><topic>Software and its engineering</topic><topic>Software and its engineering -- Software organization and properties</topic><topic>Software and its engineering -- Software organization and properties -- Extra-functional properties</topic><topic>Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance</topic><topic>Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance -- Checkpoint -- restart</topic><topic>Software and its engineering -- Software organization and properties -- Software system structures</topic><topic>Software and its engineering -- Software organization and properties -- Software system structures -- Distributed systems organizing principles</topic><topic>Uniform resource locators</topic><toplevel>online_resources</toplevel><creatorcontrib>Bouteiller, Aurélien</creatorcontrib><creatorcontrib>Cappello, Franck</creatorcontrib><creatorcontrib>Herault, Thomas</creatorcontrib><creatorcontrib>Krawezik, Géraud</creatorcontrib><creatorcontrib>Lemarinier, Pierre</creatorcontrib><creatorcontrib>Magniette, Frédéric</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Xplore</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bouteiller, Aurélien</au><au>Cappello, Franck</au><au>Herault, Thomas</au><au>Krawezik, Géraud</au><au>Lemarinier, Pierre</au><au>Magniette, Frédéric</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging</atitle><btitle>ACM/IEEE SC 2003 Conference (SC'03)</btitle><stitle>SUPERC</stitle><date>2003-11-15</date><risdate>2003</risdate><spage>25</spage><epage>25</epage><pages>25-25</pages><isbn>9781581136951</isbn><isbn>1581136951</isbn><abstract>Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/1048935.1050176</doi><tpages>1</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISBN: 9781581136951 |
ispartof | ACM/IEEE SC 2003 Conference (SC'03), 2003, p.25-25 |
issn | |
language | eng |
recordid | cdi_ieee_primary_1592928 |
source | IEEE Electronic Library (IEL) Conference Proceedings |
subjects | Checkpointing Clocks Computer systems organization Computer systems organization -- Architectures Computer systems organization -- Architectures -- Distributed architectures Costs Fault tolerance General and reference High performance computing Message passing Permission Programming profession Protocols Software and its engineering Software and its engineering -- Software organization and properties Software and its engineering -- Software organization and properties -- Extra-functional properties Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance -- Checkpoint -- restart Software and its engineering -- Software organization and properties -- Software system structures Software and its engineering -- Software organization and properties -- Software system structures -- Distributed systems organizing principles Uniform resource locators |
title | MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T16%3A02%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=MPICH-V2:%20a%20Fault%20Tolerant%20MPI%20for%20Volatile%20Nodes%20based%20on%20Pessimistic%20Sender%20Based%20Message%20Logging&rft.btitle=ACM/IEEE%20SC%202003%20Conference%20(SC'03)&rft.au=Bouteiller,%20Aur%C3%A9lien&rft.date=2003-11-15&rft.spage=25&rft.epage=25&rft.pages=25-25&rft.isbn=9781581136951&rft.isbn_list=1581136951&rft_id=info:doi/10.1145/1048935.1050176&rft_dat=%3Cproquest_6IE%3E31294876%3C/proquest_6IE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=31294876&rft_id=info:pmid/&rft_ieee_id=1592928&rfr_iscdi=true |