GPU Tensor Cores for fast Arithmetic Reductions
This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running time of the proposed chained tensor core approach is $T(n)=...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This work proposes a GPU tensor core approach that encodes the arithmetic
reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply
accumulate (MMA) operations executed in parallel by GPU tensor cores. The
asymptotic running time of the proposed chained tensor core approach is $T(n)=5
log_{m^2}{n}$ and its speedup is $S=\dfrac{4}{5} log_{2}{m^2}$ over the classic
$O(n \log n)$ parallel reduction algorithm. Experimental performance results
show that the proposed reduction method is $\sim 3.2 \times$ faster than a
conventional GPU reduction implementation, and preserves the numerical
precision because the sub-results of each chain of $R$ MMAs is kept as a 32-bit
floating point value, before being all reduced into as a final 32-bit result.
The chained MMA design allows a flexible configuration of thread-blocks; small
thread-blocks of 32 or 128 threads can still achieve maximum performance using
a chain of $R=4,5$ MMAs per block, while large thread-blocks work best with
$R=1$. The results obtained in this work show that tensor cores can indeed
provide a significant performance improvement to non-Machine Learning
applications such as the arithmetic reduction, which is an integration tool for
studying many scientific phenomena. |
---|---|
DOI: | 10.48550/arxiv.2001.05585 |