Compiling for the IBM Matrix Engine for Enterprise Workloads

The matrix-multiply assist (MMA) facility is the latest addition to IBM’s power instruction set architecture and first shipped in the recently introduced POWER10 processor. MMA is designed to accelerate matrix–matrix operations, such as matrix multiplication and convolution, using instructions that...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE MICRO 2022-09, Vol.42 (5), p.34-40
Hauptverfasser: de Carvalho, Joao P. L., Moreira, Jose E., Amaral, Jose Nelson
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The matrix-multiply assist (MMA) facility is the latest addition to IBM’s power instruction set architecture and first shipped in the recently introduced POWER10 processor. MMA is designed to accelerate matrix–matrix operations, such as matrix multiplication and convolution, using instructions that compute the outer product of vector-register operands. Outer product computations have been used for decades in linear algebra libraries to deliver high-performance implementations of matrix operations. Such libraries use conventional single-instruction–multiple-data (SIMD) instructions to emulate outer product operations. MMA in POWER10 is the first hardware with direct support for outer product operations released in the market. MMA operates with the widest diversity of data types compared to any accelerator design currently announced. Unleashing the high-performance enabled by MMA requires careful code generation. Two key considerations for optimal MMA code performance are 1) the choice of accumulation layout when maximizing the using the accumulators and 2) the selection of matrix access order. This article shows that over 92% of peak performance in POWER10 with MMA can be achieved when the code generation makes the right choices.
ISSN:0272-1732
1937-4143
DOI:10.1109/MM.2022.3176529