Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions
We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of per...
Gespeichert in:
Veröffentlicht in: | The international journal of high performance computing applications 2019-03, Vol.33 (2), p.366-383 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 383 |
---|---|
container_issue | 2 |
container_start_page | 366 |
container_title | The international journal of high performance computing applications |
container_volume | 33 |
creator | Casas, Marc Gansterer, Wilfried N Wimmer, Elias |
description | We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of performance and scalability. New gossip-based reduction algorithms are proposed, which significantly improve the state-of-the-art in terms of resilience against SDC. Moreover, a new gossip-inspired reduction algorithm is proposed, which promises a much more competitive runtime performance in an HPC context than classical gossip-based algorithms, in particular for low accuracy requirements. |
doi_str_mv | 10.1177/1094342018762531 |
format | Article |
fullrecord | <record><control><sourceid>proquest_csuc_</sourceid><recordid>TN_cdi_csuc_recercat_oai_recercat_cat_2072_330700</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sage_id>10.1177_1094342018762531</sage_id><sourcerecordid>2180446252</sourcerecordid><originalsourceid>FETCH-LOGICAL-c393t-4189cf817b445de16418f599e9c1b8c41a3c134ba08af885e29438823271378d3</originalsourceid><addsrcrecordid>eNp1UU1LxDAQLaLgunr3GPC60UySblJvsvgFgiJ6Dtk07WZpm5q0B_31Zj9gQfAQ5mXevMfwJssugVwDCHEDpOCMUwJSzGnO4CibgOCAqeTz44QTjTf8aXYW45oQMucsn2Q_7za6xtluQLWP0fXYdbF3wZZINw1OdTQ2wdoHN6zaiCof0MrVK9zbkHCru8Qb3_bj4Lr6Fr35IZk53cxQ41o36MH5Ls6Q7krke9uhr9HGbe88O6l0E-3Fvk6zz4f7j8UTfnl9fF7cvWDDCjZgDrIwlQSx5DwvLcxTo8qLwhYGltJw0MwA40tNpK6kzC1NQUhJGRXAhCzZNIOdr4mjUcEaG4welNfu8Nk8SgRVjBFBSNJc7TR98NuF1dqPoUtrKgqScJ4ypmmK7J1Dyi7YSvXBtTp8KyBqcxX19ypJgneSqGt7MP13_hcVjoza</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2180446252</pqid></control><display><type>article</type><title>Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions</title><source>SAGE Complete A-Z List</source><source>Recercat</source><source>Alma/SFX Local Collection</source><creator>Casas, Marc ; Gansterer, Wilfried N ; Wimmer, Elias</creator><creatorcontrib>Casas, Marc ; Gansterer, Wilfried N ; Wimmer, Elias</creatorcontrib><description>We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of performance and scalability. New gossip-based reduction algorithms are proposed, which significantly improve the state-of-the-art in terms of resilience against SDC. Moreover, a new gossip-inspired reduction algorithm is proposed, which promises a much more competitive runtime performance in an HPC context than classical gossip-based algorithms, in particular for low accuracy requirements.</description><identifier>ISSN: 1094-3420</identifier><identifier>EISSN: 1741-2846</identifier><identifier>DOI: 10.1177/1094342018762531</identifier><language>eng</language><publisher>London, England: SAGE Publications</publisher><subject>Algorithms ; All-reduce ; All-to-all reduction ; Bit-flip ; Computation ; Fault tolerance ; Gossip ; Gossip algorithm ; High performance computing ; Informàtica ; Push-flow agorithm ; Recursive doubling ; Reduction ; Resilience ; Servers ; Silent data corruption ; State of the art ; Supercomputadors ; Àrees temàtiques de la UPC</subject><ispartof>The international journal of high performance computing applications, 2019-03, Vol.33 (2), p.366-383</ispartof><rights>The Author(s) 2018</rights><rights>info:eu-repo/semantics/openAccess</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c393t-4189cf817b445de16418f599e9c1b8c41a3c134ba08af885e29438823271378d3</citedby><cites>FETCH-LOGICAL-c393t-4189cf817b445de16418f599e9c1b8c41a3c134ba08af885e29438823271378d3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://journals.sagepub.com/doi/pdf/10.1177/1094342018762531$$EPDF$$P50$$Gsage$$H</linktopdf><linktohtml>$$Uhttps://journals.sagepub.com/doi/10.1177/1094342018762531$$EHTML$$P50$$Gsage$$H</linktohtml><link.rule.ids>230,314,776,780,881,21799,26953,27903,27904,43600,43601</link.rule.ids></links><search><creatorcontrib>Casas, Marc</creatorcontrib><creatorcontrib>Gansterer, Wilfried N</creatorcontrib><creatorcontrib>Wimmer, Elias</creatorcontrib><title>Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions</title><title>The international journal of high performance computing applications</title><description>We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of performance and scalability. New gossip-based reduction algorithms are proposed, which significantly improve the state-of-the-art in terms of resilience against SDC. Moreover, a new gossip-inspired reduction algorithm is proposed, which promises a much more competitive runtime performance in an HPC context than classical gossip-based algorithms, in particular for low accuracy requirements.</description><subject>Algorithms</subject><subject>All-reduce</subject><subject>All-to-all reduction</subject><subject>Bit-flip</subject><subject>Computation</subject><subject>Fault tolerance</subject><subject>Gossip</subject><subject>Gossip algorithm</subject><subject>High performance computing</subject><subject>Informàtica</subject><subject>Push-flow agorithm</subject><subject>Recursive doubling</subject><subject>Reduction</subject><subject>Resilience</subject><subject>Servers</subject><subject>Silent data corruption</subject><subject>State of the art</subject><subject>Supercomputadors</subject><subject>Àrees temàtiques de la UPC</subject><issn>1094-3420</issn><issn>1741-2846</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>XX2</sourceid><recordid>eNp1UU1LxDAQLaLgunr3GPC60UySblJvsvgFgiJ6Dtk07WZpm5q0B_31Zj9gQfAQ5mXevMfwJssugVwDCHEDpOCMUwJSzGnO4CibgOCAqeTz44QTjTf8aXYW45oQMucsn2Q_7za6xtluQLWP0fXYdbF3wZZINw1OdTQ2wdoHN6zaiCof0MrVK9zbkHCru8Qb3_bj4Lr6Fr35IZk53cxQ41o36MH5Ls6Q7krke9uhr9HGbe88O6l0E-3Fvk6zz4f7j8UTfnl9fF7cvWDDCjZgDrIwlQSx5DwvLcxTo8qLwhYGltJw0MwA40tNpK6kzC1NQUhJGRXAhCzZNIOdr4mjUcEaG4welNfu8Nk8SgRVjBFBSNJc7TR98NuF1dqPoUtrKgqScJ4ypmmK7J1Dyi7YSvXBtTp8KyBqcxX19ypJgneSqGt7MP13_hcVjoza</recordid><startdate>20190301</startdate><enddate>20190301</enddate><creator>Casas, Marc</creator><creator>Gansterer, Wilfried N</creator><creator>Wimmer, Elias</creator><general>SAGE Publications</general><general>SAGE PUBLICATIONS, INC</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>XX2</scope></search><sort><creationdate>20190301</creationdate><title>Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions</title><author>Casas, Marc ; Gansterer, Wilfried N ; Wimmer, Elias</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c393t-4189cf817b445de16418f599e9c1b8c41a3c134ba08af885e29438823271378d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Algorithms</topic><topic>All-reduce</topic><topic>All-to-all reduction</topic><topic>Bit-flip</topic><topic>Computation</topic><topic>Fault tolerance</topic><topic>Gossip</topic><topic>Gossip algorithm</topic><topic>High performance computing</topic><topic>Informàtica</topic><topic>Push-flow agorithm</topic><topic>Recursive doubling</topic><topic>Reduction</topic><topic>Resilience</topic><topic>Servers</topic><topic>Silent data corruption</topic><topic>State of the art</topic><topic>Supercomputadors</topic><topic>Àrees temàtiques de la UPC</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Casas, Marc</creatorcontrib><creatorcontrib>Gansterer, Wilfried N</creatorcontrib><creatorcontrib>Wimmer, Elias</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Recercat</collection><jtitle>The international journal of high performance computing applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Casas, Marc</au><au>Gansterer, Wilfried N</au><au>Wimmer, Elias</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions</atitle><jtitle>The international journal of high performance computing applications</jtitle><date>2019-03-01</date><risdate>2019</risdate><volume>33</volume><issue>2</issue><spage>366</spage><epage>383</epage><pages>366-383</pages><issn>1094-3420</issn><eissn>1741-2846</eissn><abstract>We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of performance and scalability. New gossip-based reduction algorithms are proposed, which significantly improve the state-of-the-art in terms of resilience against SDC. Moreover, a new gossip-inspired reduction algorithm is proposed, which promises a much more competitive runtime performance in an HPC context than classical gossip-based algorithms, in particular for low accuracy requirements.</abstract><cop>London, England</cop><pub>SAGE Publications</pub><doi>10.1177/1094342018762531</doi><tpages>18</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1094-3420 |
ispartof | The international journal of high performance computing applications, 2019-03, Vol.33 (2), p.366-383 |
issn | 1094-3420 1741-2846 |
language | eng |
recordid | cdi_csuc_recercat_oai_recercat_cat_2072_330700 |
source | SAGE Complete A-Z List; Recercat; Alma/SFX Local Collection |
subjects | Algorithms All-reduce All-to-all reduction Bit-flip Computation Fault tolerance Gossip Gossip algorithm High performance computing Informàtica Push-flow agorithm Recursive doubling Reduction Resilience Servers Silent data corruption State of the art Supercomputadors Àrees temàtiques de la UPC |
title | Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T02%3A50%3A21IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_csuc_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Resilient%20gossip-inspired%20all-reduce%20algorithms%20for%20high-performance%20computing:%20Potential,%20limitations,%20and%20open%20questions&rft.jtitle=The%20international%20journal%20of%20high%20performance%20computing%20applications&rft.au=Casas,%20Marc&rft.date=2019-03-01&rft.volume=33&rft.issue=2&rft.spage=366&rft.epage=383&rft.pages=366-383&rft.issn=1094-3420&rft.eissn=1741-2846&rft_id=info:doi/10.1177/1094342018762531&rft_dat=%3Cproquest_csuc_%3E2180446252%3C/proquest_csuc_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2180446252&rft_id=info:pmid/&rft_sage_id=10.1177_1094342018762531&rfr_iscdi=true |