Memory Optimizations for Sparse Linear Algebra on GPU Hardware
An effort to maximize memory bandwidth utilization for a sparse linear algebra kernel executing on NVIDIA® Tesla V100 and A100 Graphics Processing Units (GPUs) is described. The kernel consists of a block-sparse matrix-vector product and a series of forward/backward triangular solves. The computatio...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | An effort to maximize memory bandwidth utilization for a sparse linear algebra kernel executing on NVIDIA® Tesla V100 and A100 Graphics Processing Units (GPUs) is described. The kernel consists of a block-sparse matrix-vector product and a series of forward/backward triangular solves. The computation is memory-bound and exhibits low arithmetic intensity. Along with a relatively small block size, the data layout poses a challenge to effectively utilize the available memory bandwidth on common GPU architectures. An earlier implementation using a warp to process a single row of the matrix was found to yield good memory performance on the V100 architecture. However, anew approach, which assigns a warp to six rows of the matrix, is proposed for the A100. In addition, two new features offered by the A100 architecture are explored.L2residency control enables a portion of theL2cache to be used for persistent data access, and the asynchronous copy instruction allows data to be loaded directly from main memory into shared memory. Demonstrations show that the new implementation improves memory bandwidth utilization from 71.5% to 81.2% of the peak available on theA100 architecture. |
---|