Compiling for the IBM Matrix Engine for Enterprise Workloads

The matrix-multiply assist (MMA) facility is the latest addition to IBM’s power instruction set architecture and first shipped in the recently introduced POWER10 processor. MMA is designed to accelerate matrix–matrix operations, such as matrix multiplication and convolution, using instructions that...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE MICRO 2022-09, Vol.42 (5), p.34-40
Hauptverfasser:	de Carvalho, Joao P. L., Moreira, Jose E., Amaral, Jose Nelson
Format:	Artikel
Sprache:	eng
Schlagworte:	Codes Companies Instruction sets Layout Libraries Linear algebra Machine learning Mathematical analysis Matrices (mathematics) Matrix algebra Microprocessors Multiplication Optimization Product design Program processors Reduced instruction set computing Registers System-on-chip
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The matrix-multiply assist (MMA) facility is the latest addition to IBM’s power instruction set architecture and first shipped in the recently introduced POWER10 processor. MMA is designed to accelerate matrix–matrix operations, such as matrix multiplication and convolution, using instructions that compute the outer product of vector-register operands. Outer product computations have been used for decades in linear algebra libraries to deliver high-performance implementations of matrix operations. Such libraries use conventional single-instruction–multiple-data (SIMD) instructions to emulate outer product operations. MMA in POWER10 is the first hardware with direct support for outer product operations released in the market. MMA operates with the widest diversity of data types compared to any accelerator design currently announced. Unleashing the high-performance enabled by MMA requires careful code generation. Two key considerations for optimal MMA code performance are 1) the choice of accumulation layout when maximizing the using the accumulators and 2) the selection of matrix access order. This article shows that over 92% of peak performance in POWER10 with MMA can be achieved when the code generation makes the right choices.
ISSN:	0272-1732 1937-4143
DOI:	10.1109/MM.2022.3176529