Extending the TokenCMP Cache Coherence Protocol for Low Overhead Fault Tolerance in CMP Architectures

It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on parallel and distributed systems 2008-08, Vol.19 (8), p.1044-1056
Hauptverfasser: Fernandez-Pascual, R., Garcia, J.M., Acacio, M.E., Duato, J.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1056
container_issue 8
container_start_page 1044
container_title IEEE transactions on parallel and distributed systems
container_volume 19
creator Fernandez-Pascual, R.
Garcia, J.M.
Acacio, M.E.
Duato, J.
description It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable chip multiprocessors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against TokenCMP. We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TokenCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15 percent.
doi_str_mv 10.1109/TPDS.2007.70803
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_912282636</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>4385719</ieee_id><sourcerecordid>1671231065</sourcerecordid><originalsourceid>FETCH-LOGICAL-c384t-dd241b05643b89129074fe87db49fc7efa01deb32235235ca2549dda0e7d07013</originalsourceid><addsrcrecordid>eNp90U1v1DAQBuAIgUQpnDlwsTgAl2zHX7F9rEILSFt1pS5ny2tP2JQ0LnbSj3-PwyIOHCpZsi09M6PRW1VvKawoBXOy3Xy-WjEAtVKggT-rjqiUumZU8-flDULWhlHzsnqV8zUAFRLEUYVnDxOOoR9_kGmPZBt_4thebEjrfPm2cY8JR49kk-IUfRxIFxNZx3tyeYdpjy6QczcPUykcMLlF9iNZGpwmv-8n9NOcML-uXnRuyPjm731cfT8_27Zf6_Xll2_t6br2XIupDoEJugPZCL7ThjIDSnSoVdgJ03mFnQMacMcZ47Ic75gUJgQHqAIooPy4-njoe5virxnzZG_67HEY3IhxztYAb4TiwhT54UnJhWh0mV_gpychbRRlnEIjC33_H72OcxrLwrYswzRreFPQyQH5FHNO2Nnb1N-49Ggp2CVIuwRplyDtnyBLxbtDRY-I_7TgWipq-G9mvJds</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>912282636</pqid></control><display><type>article</type><title>Extending the TokenCMP Cache Coherence Protocol for Low Overhead Fault Tolerance in CMP Architectures</title><source>IEEE Electronic Library (IEL)</source><creator>Fernandez-Pascual, R. ; Garcia, J.M. ; Acacio, M.E. ; Duato, J.</creator><creatorcontrib>Fernandez-Pascual, R. ; Garcia, J.M. ; Acacio, M.E. ; Duato, J.</creatorcontrib><description>It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable chip multiprocessors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against TokenCMP. We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TokenCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15 percent.</description><identifier>ISSN: 1045-9219</identifier><identifier>EISSN: 1558-2183</identifier><identifier>DOI: 10.1109/TPDS.2007.70803</identifier><identifier>CODEN: ITDSEO</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>and Fault-Tolerance ; BShared memory ; Chemical-mechanical polishing ; Chips ; Coherence ; Construction ; Electromagnetic interference ; Electromagnetic radiation ; Electromagnetic transients ; Electronic components ; Energy consumption ; Failure ; Fault tolerance ; Integrated circuits ; Messages ; Microprocessors ; Multi-core/single-chip multiprocessors ; Multiprocessor interconnection networks ; Networks ; Proposals ; Protocols ; Reliability ; System recovery ; Testing</subject><ispartof>IEEE transactions on parallel and distributed systems, 2008-08, Vol.19 (8), p.1044-1056</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2008</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c384t-dd241b05643b89129074fe87db49fc7efa01deb32235235ca2549dda0e7d07013</citedby><cites>FETCH-LOGICAL-c384t-dd241b05643b89129074fe87db49fc7efa01deb32235235ca2549dda0e7d07013</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/4385719$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/4385719$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Fernandez-Pascual, R.</creatorcontrib><creatorcontrib>Garcia, J.M.</creatorcontrib><creatorcontrib>Acacio, M.E.</creatorcontrib><creatorcontrib>Duato, J.</creatorcontrib><title>Extending the TokenCMP Cache Coherence Protocol for Low Overhead Fault Tolerance in CMP Architectures</title><title>IEEE transactions on parallel and distributed systems</title><addtitle>TPDS</addtitle><description>It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable chip multiprocessors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against TokenCMP. We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TokenCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15 percent.</description><subject>and Fault-Tolerance</subject><subject>BShared memory</subject><subject>Chemical-mechanical polishing</subject><subject>Chips</subject><subject>Coherence</subject><subject>Construction</subject><subject>Electromagnetic interference</subject><subject>Electromagnetic radiation</subject><subject>Electromagnetic transients</subject><subject>Electronic components</subject><subject>Energy consumption</subject><subject>Failure</subject><subject>Fault tolerance</subject><subject>Integrated circuits</subject><subject>Messages</subject><subject>Microprocessors</subject><subject>Multi-core/single-chip multiprocessors</subject><subject>Multiprocessor interconnection networks</subject><subject>Networks</subject><subject>Proposals</subject><subject>Protocols</subject><subject>Reliability</subject><subject>System recovery</subject><subject>Testing</subject><issn>1045-9219</issn><issn>1558-2183</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2008</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNp90U1v1DAQBuAIgUQpnDlwsTgAl2zHX7F9rEILSFt1pS5ny2tP2JQ0LnbSj3-PwyIOHCpZsi09M6PRW1VvKawoBXOy3Xy-WjEAtVKggT-rjqiUumZU8-flDULWhlHzsnqV8zUAFRLEUYVnDxOOoR9_kGmPZBt_4thebEjrfPm2cY8JR49kk-IUfRxIFxNZx3tyeYdpjy6QczcPUykcMLlF9iNZGpwmv-8n9NOcML-uXnRuyPjm731cfT8_27Zf6_Xll2_t6br2XIupDoEJugPZCL7ThjIDSnSoVdgJ03mFnQMacMcZ47Ic75gUJgQHqAIooPy4-njoe5virxnzZG_67HEY3IhxztYAb4TiwhT54UnJhWh0mV_gpychbRRlnEIjC33_H72OcxrLwrYswzRreFPQyQH5FHNO2Nnb1N-49Ggp2CVIuwRplyDtnyBLxbtDRY-I_7TgWipq-G9mvJds</recordid><startdate>20080801</startdate><enddate>20080801</enddate><creator>Fernandez-Pascual, R.</creator><creator>Garcia, J.M.</creator><creator>Acacio, M.E.</creator><creator>Duato, J.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>F28</scope><scope>FR3</scope></search><sort><creationdate>20080801</creationdate><title>Extending the TokenCMP Cache Coherence Protocol for Low Overhead Fault Tolerance in CMP Architectures</title><author>Fernandez-Pascual, R. ; Garcia, J.M. ; Acacio, M.E. ; Duato, J.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c384t-dd241b05643b89129074fe87db49fc7efa01deb32235235ca2549dda0e7d07013</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2008</creationdate><topic>and Fault-Tolerance</topic><topic>BShared memory</topic><topic>Chemical-mechanical polishing</topic><topic>Chips</topic><topic>Coherence</topic><topic>Construction</topic><topic>Electromagnetic interference</topic><topic>Electromagnetic radiation</topic><topic>Electromagnetic transients</topic><topic>Electronic components</topic><topic>Energy consumption</topic><topic>Failure</topic><topic>Fault tolerance</topic><topic>Integrated circuits</topic><topic>Messages</topic><topic>Microprocessors</topic><topic>Multi-core/single-chip multiprocessors</topic><topic>Multiprocessor interconnection networks</topic><topic>Networks</topic><topic>Proposals</topic><topic>Protocols</topic><topic>Reliability</topic><topic>System recovery</topic><topic>Testing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Fernandez-Pascual, R.</creatorcontrib><creatorcontrib>Garcia, J.M.</creatorcontrib><creatorcontrib>Acacio, M.E.</creatorcontrib><creatorcontrib>Duato, J.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><jtitle>IEEE transactions on parallel and distributed systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Fernandez-Pascual, R.</au><au>Garcia, J.M.</au><au>Acacio, M.E.</au><au>Duato, J.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Extending the TokenCMP Cache Coherence Protocol for Low Overhead Fault Tolerance in CMP Architectures</atitle><jtitle>IEEE transactions on parallel and distributed systems</jtitle><stitle>TPDS</stitle><date>2008-08-01</date><risdate>2008</risdate><volume>19</volume><issue>8</issue><spage>1044</spage><epage>1056</epage><pages>1044-1056</pages><issn>1045-9219</issn><eissn>1558-2183</eissn><coden>ITDSEO</coden><abstract>It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On the other hand, chip-multiprocessors (CMP) that integrate several processor cores in a single chip are nowadays the best alternative to more efficient use of the increasing number of transistors that can be placed in a single die. Hence, it is necessary to design new techniques to deal with these faults to be able to build sufficiently reliable chip multiprocessors (CMPs). In this work, we present a coherence protocol aimed at dealing with transient failures that affect the interconnection network of a CMP, thus assuming that the network is no longer reliable. In particular, our proposal extends a token-based cache coherence protocol so that no data can be lost and no deadlock can occur due to any dropped message. Using GEMS full system simulator, we compare our proposal against TokenCMP. We show that in absence of failures our proposal does not introduce overhead in terms of increased execution time over TokenCMP. Additionally, our protocol can tolerate message loss rates much higher than those likely to be found in the real world without increasing execution time more than 15 percent.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TPDS.2007.70803</doi><tpages>13</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1045-9219
ispartof IEEE transactions on parallel and distributed systems, 2008-08, Vol.19 (8), p.1044-1056
issn 1045-9219
1558-2183
language eng
recordid cdi_proquest_journals_912282636
source IEEE Electronic Library (IEL)
subjects and Fault-Tolerance
BShared memory
Chemical-mechanical polishing
Chips
Coherence
Construction
Electromagnetic interference
Electromagnetic radiation
Electromagnetic transients
Electronic components
Energy consumption
Failure
Fault tolerance
Integrated circuits
Messages
Microprocessors
Multi-core/single-chip multiprocessors
Multiprocessor interconnection networks
Networks
Proposals
Protocols
Reliability
System recovery
Testing
title Extending the TokenCMP Cache Coherence Protocol for Low Overhead Fault Tolerance in CMP Architectures
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T02%3A05%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Extending%20the%20TokenCMP%20Cache%20Coherence%20Protocol%20for%20Low%20Overhead%20Fault%20Tolerance%20in%20CMP%20Architectures&rft.jtitle=IEEE%20transactions%20on%20parallel%20and%20distributed%20systems&rft.au=Fernandez-Pascual,%20R.&rft.date=2008-08-01&rft.volume=19&rft.issue=8&rft.spage=1044&rft.epage=1056&rft.pages=1044-1056&rft.issn=1045-9219&rft.eissn=1558-2183&rft.coden=ITDSEO&rft_id=info:doi/10.1109/TPDS.2007.70803&rft_dat=%3Cproquest_RIE%3E1671231065%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=912282636&rft_id=info:pmid/&rft_ieee_id=4385719&rfr_iscdi=true