GPU Tensor Cores for fast Arithmetic Reductions
This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Navarro, Cristóbal A Carrasco, Roberto Barrientos, Ricardo J Riquelme, Javier A Vega, Raimundo |
description | This work proposes a GPU tensor core approach that encodes the arithmetic
reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply
accumulate (MMA) operations executed in parallel by GPU tensor cores. The
asymptotic running time of the proposed chained tensor core approach is $T(n)=5
log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic
$O(n \log n)$ parallel reduction algorithm. Experimental performance results
show that the proposed reduction method is $\sim 3.2 \times$ faster than a
conventional GPU reduction implementation, and preserves the numerical
precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit
floating point value, before being all reduced into as a final 32-bit result.
The chained MMA design allows a flexible configuration of thread-blocks; small
thread-blocks of 32 or 128 threads can still achieve maximum performance using
a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with
$R=1$. The results obtained in this work show that tensor cores can indeed
provide a significant performance improvement to non-Machine Learning
applications such as the arithmetic reduction, which is an integration tool for
studying many scientific phenomena. |
doi_str_mv | 10.48550/arxiv.2001.05585 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2001_05585</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2001_05585</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-5e156a5255e8619c98d98e4ce77b4ccf2075de220f7632b4ed3a285be37650743</originalsourceid><addsrcrecordid>eNotzrFqwzAUhWEtGYrTB-gUvYAdWdKV5DGYNC0YWoozG1m-IoI4DpIb0rdv63Y6_3T4CHkqWSENANvaeA-3gjNWFgzAwAPZHt6PtMVLmiKtp4iJ-p_yNs10F8N8GnEOjn7g8OnmMF3Smqy8PSd8_N-MtM_7tn7Jm7fDa71rcqs05IAlKAscAI0qK1eZoTIoHWrdS-c8ZxoG5Jx5rQTvJQ7CcgM9Cq2AaSkysvm7XcTdNYbRxq_uV94tcvENZcs8dg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>GPU Tensor Cores for fast Arithmetic Reductions</title><source>arXiv.org</source><creator>Navarro, Cristóbal A ; Carrasco, Roberto ; Barrientos, Ricardo J ; Riquelme, Javier A ; Vega, Raimundo</creator><creatorcontrib>Navarro, Cristóbal A ; Carrasco, Roberto ; Barrientos, Ricardo J ; Riquelme, Javier A ; Vega, Raimundo</creatorcontrib><description>This work proposes a GPU tensor core approach that encodes the arithmetic
reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply
accumulate (MMA) operations executed in parallel by GPU tensor cores. The
asymptotic running time of the proposed chained tensor core approach is $T(n)=5
log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic
$O(n \log n)$ parallel reduction algorithm. Experimental performance results
show that the proposed reduction method is $\sim 3.2 \times$ faster than a
conventional GPU reduction implementation, and preserves the numerical
precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit
floating point value, before being all reduced into as a final 32-bit result.
The chained MMA design allows a flexible configuration of thread-blocks; small
thread-blocks of 32 or 128 threads can still achieve maximum performance using
a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with
$R=1$. The results obtained in this work show that tensor cores can indeed
provide a significant performance improvement to non-Machine Learning
applications such as the arithmetic reduction, which is an integration tool for
studying many scientific phenomena.</description><identifier>DOI: 10.48550/arxiv.2001.05585</identifier><language>eng</language><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><creationdate>2020-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2001.05585$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2001.05585$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Navarro, Cristóbal A</creatorcontrib><creatorcontrib>Carrasco, Roberto</creatorcontrib><creatorcontrib>Barrientos, Ricardo J</creatorcontrib><creatorcontrib>Riquelme, Javier A</creatorcontrib><creatorcontrib>Vega, Raimundo</creatorcontrib><title>GPU Tensor Cores for fast Arithmetic Reductions</title><description>This work proposes a GPU tensor core approach that encodes the arithmetic
reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply
accumulate (MMA) operations executed in parallel by GPU tensor cores. The
asymptotic running time of the proposed chained tensor core approach is $T(n)=5
log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic
$O(n \log n)$ parallel reduction algorithm. Experimental performance results
show that the proposed reduction method is $\sim 3.2 \times$ faster than a
conventional GPU reduction implementation, and preserves the numerical
precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit
floating point value, before being all reduced into as a final 32-bit result.
The chained MMA design allows a flexible configuration of thread-blocks; small
thread-blocks of 32 or 128 threads can still achieve maximum performance using
a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with
$R=1$. The results obtained in this work show that tensor cores can indeed
provide a significant performance improvement to non-Machine Learning
applications such as the arithmetic reduction, which is an integration tool for
studying many scientific phenomena.</description><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzrFqwzAUhWEtGYrTB-gUvYAdWdKV5DGYNC0YWoozG1m-IoI4DpIb0rdv63Y6_3T4CHkqWSENANvaeA-3gjNWFgzAwAPZHt6PtMVLmiKtp4iJ-p_yNs10F8N8GnEOjn7g8OnmMF3Smqy8PSd8_N-MtM_7tn7Jm7fDa71rcqs05IAlKAscAI0qK1eZoTIoHWrdS-c8ZxoG5Jx5rQTvJQ7CcgM9Cq2AaSkysvm7XcTdNYbRxq_uV94tcvENZcs8dg</recordid><startdate>20200115</startdate><enddate>20200115</enddate><creator>Navarro, Cristóbal A</creator><creator>Carrasco, Roberto</creator><creator>Barrientos, Ricardo J</creator><creator>Riquelme, Javier A</creator><creator>Vega, Raimundo</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20200115</creationdate><title>GPU Tensor Cores for fast Arithmetic Reductions</title><author>Navarro, Cristóbal A ; Carrasco, Roberto ; Barrientos, Ricardo J ; Riquelme, Javier A ; Vega, Raimundo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-5e156a5255e8619c98d98e4ce77b4ccf2075de220f7632b4ed3a285be37650743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><toplevel>online_resources</toplevel><creatorcontrib>Navarro, Cristóbal A</creatorcontrib><creatorcontrib>Carrasco, Roberto</creatorcontrib><creatorcontrib>Barrientos, Ricardo J</creatorcontrib><creatorcontrib>Riquelme, Javier A</creatorcontrib><creatorcontrib>Vega, Raimundo</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Navarro, Cristóbal A</au><au>Carrasco, Roberto</au><au>Barrientos, Ricardo J</au><au>Riquelme, Javier A</au><au>Vega, Raimundo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>GPU Tensor Cores for fast Arithmetic Reductions</atitle><date>2020-01-15</date><risdate>2020</risdate><abstract>This work proposes a GPU tensor core approach that encodes the arithmetic
reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply
accumulate (MMA) operations executed in parallel by GPU tensor cores. The
asymptotic running time of the proposed chained tensor core approach is $T(n)=5
log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic
$O(n \log n)$ parallel reduction algorithm. Experimental performance results
show that the proposed reduction method is $\sim 3.2 \times$ faster than a
conventional GPU reduction implementation, and preserves the numerical
precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit
floating point value, before being all reduced into as a final 32-bit result.
The chained MMA design allows a flexible configuration of thread-blocks; small
thread-blocks of 32 or 128 threads can still achieve maximum performance using
a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with
$R=1$. The results obtained in this work show that tensor cores can indeed
provide a significant performance improvement to non-Machine Learning
applications such as the arithmetic reduction, which is an integration tool for
studying many scientific phenomena.</abstract><doi>10.48550/arxiv.2001.05585</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2001.05585 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2001_05585 |
source | arXiv.org |
subjects | Computer Science - Distributed, Parallel, and Cluster Computing |
title | GPU Tensor Cores for fast Arithmetic Reductions |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T16%3A06%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=GPU%20Tensor%20Cores%20for%20fast%20Arithmetic%20Reductions&rft.au=Navarro,%20Crist%C3%B3bal%20A&rft.date=2020-01-15&rft_id=info:doi/10.48550/arxiv.2001.05585&rft_dat=%3Carxiv_GOX%3E2001_05585%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |