Can GPU performance increase faster than the code error rate?

Graphics processing units (GPUs) are the reference architecture to accelerate high-performance computing applications and the training/interference of convolutional neural networks. For both these domains, performance and reliability are two of the main constraints. It is believed that the only way...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	The Journal of supercomputing 2024-08, Vol.80 (12), p.16918-16946
Hauptverfasser:	dos Santos, Fernando Fernandes, Rech, Paolo
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Artificial neural networks Compilers Computer Science Configuration management Graphics processing units Hardware Architecture Interpreters Processor Architectures Programming Languages Reliability
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	16946
container_issue	12
container_start_page	16918
container_title	The Journal of supercomputing
container_volume	80
creator	dos Santos, Fernando Fernandes Rech, Paolo
description	Graphics processing units (GPUs) are the reference architecture to accelerate high-performance computing applications and the training/interference of convolutional neural networks. For both these domains, performance and reliability are two of the main constraints. It is believed that the only way to increase reliability is to sacrifice performance, e.g., using redundancies. We show in this paper that this is not always the case. As a very promising result, we found that most GPUs performance improvements also bring the benefit of increasing the number of executions correctly completed before experiencing a silent data corruption (SDC). We consider four different common GPUs’ performance optimizations: architectural solutions, software implementations, compiler optimizations, and threads degree of parallelism. We compare different implementations of a variety of parallel codes and, through beam experiments and applications profiling, we show that the performance improvement typically (but not necessarily) increases the GPU SDC rate. Nevertheless, for the vast majority of the configurations the performance gain is much higher than the SDC rate increase, allowing to process a higher amount of correct data. As we show, the programmer choices can increase up to 25 × the number of correctly completed executions without redesigning the algorithm nor including specific hardening solutions.
doi_str_mv	10.1007/s11227-024-06119-4
format	Article
fullrecord	<record><control><sourceid>proquest_hal_p</sourceid><recordid>TN_cdi_hal_primary_oai_HAL_hal_04528798v1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3077092766</sourcerecordid><originalsourceid>FETCH-LOGICAL-c348t-e88e791096130ec6bde5fdb74fe8e0596acb3eb55fcdd97a069a5639938422963</originalsourceid><addsrcrecordid>eNp9kM1KAzEURoMoWKsv4CrgykX05j9ZiJSirVDQhV2HdOaObWlnajIVfHunjujOVeByvkM4hFxyuOEA9jZzLoRlIBQDw7ln6ogMuLaSgXLqmAzAC2BOK3FKznJeA4CSVg7I3TjWdPIypztMVZO2sS6QruoiYcxIq5hbTLRddlC7RFo0JVJMqUk0xRbvz8lJFTcZL37eIZk_PryOp2z2PHkaj2askMq1DJ1D6zl4wyVgYRYl6qpcWFWhQ9DexGIhcaF1VZSltxGMj9pI76VTQngjh-S69y7jJuzSahvTZ2jiKkxHs3C4gdLCWe8-eMde9ewuNe97zG1YN_tUd98LEqztQlhzMIqeKlKTc8LqV8shHJKGPmnokobvpEF1I9mPcgfXb5j-1P-svgANjXa7</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3077092766</pqid></control><display><type>article</type><title>Can GPU performance increase faster than the code error rate?</title><source>SpringerLink Journals</source><creator>dos Santos, Fernando Fernandes ; Rech, Paolo</creator><creatorcontrib>dos Santos, Fernando Fernandes ; Rech, Paolo</creatorcontrib><description>Graphics processing units (GPUs) are the reference architecture to accelerate high-performance computing applications and the training/interference of convolutional neural networks. For both these domains, performance and reliability are two of the main constraints. It is believed that the only way to increase reliability is to sacrifice performance, e.g., using redundancies. We show in this paper that this is not always the case. As a very promising result, we found that most GPUs performance improvements also bring the benefit of increasing the number of executions correctly completed before experiencing a silent data corruption (SDC). We consider four different common GPUs’ performance optimizations: architectural solutions, software implementations, compiler optimizations, and threads degree of parallelism. We compare different implementations of a variety of parallel codes and, through beam experiments and applications profiling, we show that the performance improvement typically (but not necessarily) increases the GPU SDC rate. Nevertheless, for the vast majority of the configurations the performance gain is much higher than the SDC rate increase, allowing to process a higher amount of correct data. As we show, the programmer choices can increase up to 25 × the number of correctly completed executions without redesigning the algorithm nor including specific hardening solutions.</description><identifier>ISSN: 0920-8542</identifier><identifier>EISSN: 1573-0484</identifier><identifier>DOI: 10.1007/s11227-024-06119-4</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Algorithms ; Artificial neural networks ; Compilers ; Computer Science ; Configuration management ; Graphics processing units ; Hardware Architecture ; Interpreters ; Processor Architectures ; Programming Languages ; Reliability</subject><ispartof>The Journal of supercomputing, 2024-08, Vol.80 (12), p.16918-16946</ispartof><rights>The Author(s) 2024</rights><rights>The Author(s) 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>Attribution</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c348t-e88e791096130ec6bde5fdb74fe8e0596acb3eb55fcdd97a069a5639938422963</cites><orcidid>0000-0002-0821-1879 ; 0000-0002-3504-9862</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11227-024-06119-4$$EPDF$$P50$$Gspringer$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11227-024-06119-4$$EHTML$$P50$$Gspringer$$Hfree_for_read</linktohtml><link.rule.ids>230,314,776,780,881,27901,27902,41464,42533,51294</link.rule.ids><backlink>$$Uhttps://hal.science/hal-04528798$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>dos Santos, Fernando Fernandes</creatorcontrib><creatorcontrib>Rech, Paolo</creatorcontrib><title>Can GPU performance increase faster than the code error rate?</title><title>The Journal of supercomputing</title><addtitle>J Supercomput</addtitle><description>Graphics processing units (GPUs) are the reference architecture to accelerate high-performance computing applications and the training/interference of convolutional neural networks. For both these domains, performance and reliability are two of the main constraints. It is believed that the only way to increase reliability is to sacrifice performance, e.g., using redundancies. We show in this paper that this is not always the case. As a very promising result, we found that most GPUs performance improvements also bring the benefit of increasing the number of executions correctly completed before experiencing a silent data corruption (SDC). We consider four different common GPUs’ performance optimizations: architectural solutions, software implementations, compiler optimizations, and threads degree of parallelism. We compare different implementations of a variety of parallel codes and, through beam experiments and applications profiling, we show that the performance improvement typically (but not necessarily) increases the GPU SDC rate. Nevertheless, for the vast majority of the configurations the performance gain is much higher than the SDC rate increase, allowing to process a higher amount of correct data. As we show, the programmer choices can increase up to 25 × the number of correctly completed executions without redesigning the algorithm nor including specific hardening solutions.</description><subject>Algorithms</subject><subject>Artificial neural networks</subject><subject>Compilers</subject><subject>Computer Science</subject><subject>Configuration management</subject><subject>Graphics processing units</subject><subject>Hardware Architecture</subject><subject>Interpreters</subject><subject>Processor Architectures</subject><subject>Programming Languages</subject><subject>Reliability</subject><issn>0920-8542</issn><issn>1573-0484</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><recordid>eNp9kM1KAzEURoMoWKsv4CrgykX05j9ZiJSirVDQhV2HdOaObWlnajIVfHunjujOVeByvkM4hFxyuOEA9jZzLoRlIBQDw7ln6ogMuLaSgXLqmAzAC2BOK3FKznJeA4CSVg7I3TjWdPIypztMVZO2sS6QruoiYcxIq5hbTLRddlC7RFo0JVJMqUk0xRbvz8lJFTcZL37eIZk_PryOp2z2PHkaj2askMq1DJ1D6zl4wyVgYRYl6qpcWFWhQ9DexGIhcaF1VZSltxGMj9pI76VTQngjh-S69y7jJuzSahvTZ2jiKkxHs3C4gdLCWe8-eMde9ewuNe97zG1YN_tUd98LEqztQlhzMIqeKlKTc8LqV8shHJKGPmnokobvpEF1I9mPcgfXb5j-1P-svgANjXa7</recordid><startdate>20240801</startdate><enddate>20240801</enddate><creator>dos Santos, Fernando Fernandes</creator><creator>Rech, Paolo</creator><general>Springer US</general><general>Springer Nature B.V</general><general>Springer Verlag</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0002-0821-1879</orcidid><orcidid>https://orcid.org/0000-0002-3504-9862</orcidid></search><sort><creationdate>20240801</creationdate><title>Can GPU performance increase faster than the code error rate?</title><author>dos Santos, Fernando Fernandes ; Rech, Paolo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c348t-e88e791096130ec6bde5fdb74fe8e0596acb3eb55fcdd97a069a5639938422963</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algorithms</topic><topic>Artificial neural networks</topic><topic>Compilers</topic><topic>Computer Science</topic><topic>Configuration management</topic><topic>Graphics processing units</topic><topic>Hardware Architecture</topic><topic>Interpreters</topic><topic>Processor Architectures</topic><topic>Programming Languages</topic><topic>Reliability</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>dos Santos, Fernando Fernandes</creatorcontrib><creatorcontrib>Rech, Paolo</creatorcontrib><collection>Springer Nature OA Free Journals</collection><collection>CrossRef</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>The Journal of supercomputing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>dos Santos, Fernando Fernandes</au><au>Rech, Paolo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Can GPU performance increase faster than the code error rate?</atitle><jtitle>The Journal of supercomputing</jtitle><stitle>J Supercomput</stitle><date>2024-08-01</date><risdate>2024</risdate><volume>80</volume><issue>12</issue><spage>16918</spage><epage>16946</epage><pages>16918-16946</pages><issn>0920-8542</issn><eissn>1573-0484</eissn><abstract>Graphics processing units (GPUs) are the reference architecture to accelerate high-performance computing applications and the training/interference of convolutional neural networks. For both these domains, performance and reliability are two of the main constraints. It is believed that the only way to increase reliability is to sacrifice performance, e.g., using redundancies. We show in this paper that this is not always the case. As a very promising result, we found that most GPUs performance improvements also bring the benefit of increasing the number of executions correctly completed before experiencing a silent data corruption (SDC). We consider four different common GPUs’ performance optimizations: architectural solutions, software implementations, compiler optimizations, and threads degree of parallelism. We compare different implementations of a variety of parallel codes and, through beam experiments and applications profiling, we show that the performance improvement typically (but not necessarily) increases the GPU SDC rate. Nevertheless, for the vast majority of the configurations the performance gain is much higher than the SDC rate increase, allowing to process a higher amount of correct data. As we show, the programmer choices can increase up to 25 × the number of correctly completed executions without redesigning the algorithm nor including specific hardening solutions.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11227-024-06119-4</doi><tpages>29</tpages><orcidid>https://orcid.org/0000-0002-0821-1879</orcidid><orcidid>https://orcid.org/0000-0002-3504-9862</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0920-8542
ispartof	The Journal of supercomputing, 2024-08, Vol.80 (12), p.16918-16946
issn	0920-8542 1573-0484
language	eng
recordid	cdi_hal_primary_oai_HAL_hal_04528798v1
source	SpringerLink Journals
subjects	Algorithms Artificial neural networks Compilers Computer Science Configuration management Graphics processing units Hardware Architecture Interpreters Processor Architectures Programming Languages Reliability
title	Can GPU performance increase faster than the code error rate?
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-15T03%3A19%3A37IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_hal_p&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Can%20GPU%20performance%20increase%20faster%20than%20the%20code%20error%20rate?&rft.jtitle=The%20Journal%20of%20supercomputing&rft.au=dos%20Santos,%20Fernando%20Fernandes&rft.date=2024-08-01&rft.volume=80&rft.issue=12&rft.spage=16918&rft.epage=16946&rft.pages=16918-16946&rft.issn=0920-8542&rft.eissn=1573-0484&rft_id=info:doi/10.1007/s11227-024-06119-4&rft_dat=%3Cproquest_hal_p%3E3077092766%3C/proquest_hal_p%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3077092766&rft_id=info:pmid/&rfr_iscdi=true