GPU Tensor Cores for fast Arithmetic Reductions

This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Navarro, Cristóbal A, Carrasco, Roberto, Barrientos, Ricardo J, Riquelme, Javier A, Vega, Raimundo
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Distributed, Parallel, and Cluster Computing
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Navarro, Cristóbal A Carrasco, Roberto Barrientos, Ricardo J Riquelme, Javier A Vega, Raimundo
description	This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=5 log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic $O(n \log n)$ parallel reduction algorithm. Experimental performance results show that the proposed reduction method is $\sim 3.2 \times$ faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with $R=1$. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.
doi_str_mv	10.48550/arxiv.2001.05585
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2001_05585</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2001_05585</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-5e156a5255e8619c98d98e4ce77b4ccf2075de220f7632b4ed3a285be37650743</originalsourceid><addsrcrecordid>eNotzrFqwzAUhWEtGYrTB-gUvYAdWdKV5DGYNC0YWoozG1m-IoI4DpIb0rdv63Y6_3T4CHkqWSENANvaeA-3gjNWFgzAwAPZHt6PtMVLmiKtp4iJ-p_yNs10F8N8GnEOjn7g8OnmMF3Smqy8PSd8_N-MtM_7tn7Jm7fDa71rcqs05IAlKAscAI0qK1eZoTIoHWrdS-c8ZxoG5Jx5rQTvJQ7CcgM9Cq2AaSkysvm7XcTdNYbRxq_uV94tcvENZcs8dg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>GPU Tensor Cores for fast Arithmetic Reductions</title><source>arXiv.org</source><creator>Navarro, Cristóbal A ; Carrasco, Roberto ; Barrientos, Ricardo J ; Riquelme, Javier A ; Vega, Raimundo</creator><creatorcontrib>Navarro, Cristóbal A ; Carrasco, Roberto ; Barrientos, Ricardo J ; Riquelme, Javier A ; Vega, Raimundo</creatorcontrib><description>This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=5 log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic $O(n \log n)$ parallel reduction algorithm. Experimental performance results show that the proposed reduction method is $\sim 3.2 \times$ faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with $R=1$. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.</description><identifier>DOI: 10.48550/arxiv.2001.05585</identifier><language>eng</language><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><creationdate>2020-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2001.05585$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2001.05585$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Navarro, Cristóbal A</creatorcontrib><creatorcontrib>Carrasco, Roberto</creatorcontrib><creatorcontrib>Barrientos, Ricardo J</creatorcontrib><creatorcontrib>Riquelme, Javier A</creatorcontrib><creatorcontrib>Vega, Raimundo</creatorcontrib><title>GPU Tensor Cores for fast Arithmetic Reductions</title><description>This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=5 log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic $O(n \log n)$ parallel reduction algorithm. Experimental performance results show that the proposed reduction method is $\sim 3.2 \times$ faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with $R=1$. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.</description><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzrFqwzAUhWEtGYrTB-gUvYAdWdKV5DGYNC0YWoozG1m-IoI4DpIb0rdv63Y6_3T4CHkqWSENANvaeA-3gjNWFgzAwAPZHt6PtMVLmiKtp4iJ-p_yNs10F8N8GnEOjn7g8OnmMF3Smqy8PSd8_N-MtM_7tn7Jm7fDa71rcqs05IAlKAscAI0qK1eZoTIoHWrdS-c8ZxoG5Jx5rQTvJQ7CcgM9Cq2AaSkysvm7XcTdNYbRxq_uV94tcvENZcs8dg</recordid><startdate>20200115</startdate><enddate>20200115</enddate><creator>Navarro, Cristóbal A</creator><creator>Carrasco, Roberto</creator><creator>Barrientos, Ricardo J</creator><creator>Riquelme, Javier A</creator><creator>Vega, Raimundo</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20200115</creationdate><title>GPU Tensor Cores for fast Arithmetic Reductions</title><author>Navarro, Cristóbal A ; Carrasco, Roberto ; Barrientos, Ricardo J ; Riquelme, Javier A ; Vega, Raimundo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-5e156a5255e8619c98d98e4ce77b4ccf2075de220f7632b4ed3a285be37650743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><toplevel>online_resources</toplevel><creatorcontrib>Navarro, Cristóbal A</creatorcontrib><creatorcontrib>Carrasco, Roberto</creatorcontrib><creatorcontrib>Barrientos, Ricardo J</creatorcontrib><creatorcontrib>Riquelme, Javier A</creatorcontrib><creatorcontrib>Vega, Raimundo</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Navarro, Cristóbal A</au><au>Carrasco, Roberto</au><au>Barrientos, Ricardo J</au><au>Riquelme, Javier A</au><au>Vega, Raimundo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>GPU Tensor Cores for fast Arithmetic Reductions</atitle><date>2020-01-15</date><risdate>2020</risdate><abstract>This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=5 log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic $O(n \log n)$ parallel reduction algorithm. Experimental performance results show that the proposed reduction method is $\sim 3.2 \times$ faster than a conventional GPU reduction implementation, and preserves the numerical precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit floating point value, before being all reduced into as a final 32-bit result. The chained MMA design allows a flexible configuration of thread-blocks; small thread-blocks of 32 or 128 threads can still achieve maximum performance using a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with $R=1$. The results obtained in this work show that tensor cores can indeed provide a significant performance improvement to non-Machine Learning applications such as the arithmetic reduction, which is an integration tool for studying many scientific phenomena.</abstract><doi>10.48550/arxiv.2001.05585</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2001.05585
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2001_05585
source	arXiv.org
subjects	Computer Science - Distributed, Parallel, and Cluster Computing
title	GPU Tensor Cores for fast Arithmetic Reductions
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T16%3A06%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=GPU%20Tensor%20Cores%20for%20fast%20Arithmetic%20Reductions&rft.au=Navarro,%20Crist%C3%B3bal%20A&rft.date=2020-01-15&rft_id=info:doi/10.48550/arxiv.2001.05585&rft_dat=%3Carxiv_GOX%3E2001_05585%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true