A poly-algorithm for parallel dense matrix multiplication on two-dimensional process grid topologies

In this paper, we present several new and generalized parallel dense matrix multiplication algorithms of the form C = αAB + β C on two‐dimensional process grid topologies. These algorithms can deal with rectangular matrices distributed on rectangular grids. We classify these algorithms coherently in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Concurrency (Chichester, England.) England.), 1997-05, Vol.9 (5), p.345-389
Hauptverfasser: Li, J., Skjellum, A., Falgout, R. D.
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In this paper, we present several new and generalized parallel dense matrix multiplication algorithms of the form C = αAB + β C on two‐dimensional process grid topologies. These algorithms can deal with rectangular matrices distributed on rectangular grids. We classify these algorithms coherently into three categories according to the communication primitives used and thus we offer a taxonomy for this family of related algorithms. All these algorithms are represented in the data distribution independent approach and thus do not require a specific data distribution for correctness. The algorithmic compatibility condition result shown here ensures the correctness of the matrix multiplication. We define and extend the data distribution functions and introduce permutation compatibility and algorithmic compatibility. We also discuss a permutation compatible data distribution (modified virtual 2D data distribution). We conclude that no single algorithm always achieves the best performance on different matrix and grid shapes. A practical approach to resolve this dilemma is to use poly‐algorithms. We analyze the characteristics of each of these matrix multiplication algorithms and provide initial heuristics for using the poly‐algorithm. All these matrix multiplication algorithms have been tested on the IBM SP2 system. The experimental results are presented in order to demonstrate their relative performance characteristics, motivating the combined value of the taxonomy and new algorithms introduced here. © 1997 by John Wiley & Sons, Ltd.
ISSN:1040-3108
1096-9128
DOI:10.1002/(SICI)1096-9128(199705)9:5<345::AID-CPE258>3.0.CO;2-7