Task-Based Algorithm for Matrix Multiplication: A Step Towards Block-Sparse Tensor Computing
Distributed-memory matrix multiplication (MM) is a key element of algorithms in many domains (machine learning, quantum physics). Conventional algorithms for dense MM rely on regular/uniform data decomposition to ensure load balance. These traits conflict with the irregular structure (block-sparse o...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Distributed-memory matrix multiplication (MM) is a key element of algorithms
in many domains (machine learning, quantum physics). Conventional algorithms
for dense MM rely on regular/uniform data decomposition to ensure load balance.
These traits conflict with the irregular structure (block-sparse or rank-sparse
within blocks) that is increasingly relevant for fast methods in quantum
physics. To deal with such irregular data we present a new MM algorithm based
on Scalable Universal Matrix Multiplication Algorithm (SUMMA). The novel
features are: (1) multiple-issue scheduling of SUMMA iterations, and (2)
fine-grained task-based formulation. The latter eliminates the need for
explicit internodal synchronization; with multiple-iteration scheduling this
allows load imbalance due to nonuniform matrix structure. For square MM with
uniform and nonuniform block sizes (the latter simulates matrices with general
irregular structure) we found excellent performance in weak and strong-scaling
regimes, on commodity and high-end hardware. |
---|---|
DOI: | 10.48550/arxiv.1504.05046 |