Power-Intent Systolic Array Using Modified Parallel Multiplier for Machine Learning Acceleration

Systolic arrays are an integral part of many modern machine learning (ML) accelerators due to their efficiency in performing matrix multiplication that is a key primitive in modern ML models. Current state-of-the-art in systolic array-based accelerators mainly target area and delay optimizations wit...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Sensors (Basel, Switzerland) Switzerland), 2023-04, Vol.23 (9), p.4297
Hauptverfasser: Inayat, Kashif, Muslim, Fahad Bin, Iqbal, Javed, Hassnain Mohsan, Syed Agha, Alkahtani, Hend Khalid, Mostafa, Samih M
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Systolic arrays are an integral part of many modern machine learning (ML) accelerators due to their efficiency in performing matrix multiplication that is a key primitive in modern ML models. Current state-of-the-art in systolic array-based accelerators mainly target area and delay optimizations with power optimization being considered as a secondary target. Very few accelerator designs directly target power optimizations and that too using very complex algorithmic modifications that in turn result in a compromise in the area or delay performance. We present a novel Power-Intent Systolic Array (PI-SA) that is based on the fine-grained power gating of the multiplication and accumulation (MAC) block multiplier inside the processing element of the systolic array, which reduces the design power consumption quite significantly, but with an additional delay cost. To offset the delay cost, we introduce a modified decomposition multiplier to obtain smaller reduction tree and to further improve area and delay, we also replace the carry propagation adder with a carry save adder inside each sub-multiplier. Comparison of the proposed design with the baseline Gemmini naive systolic array design and its variant, i.e., a conventional systolic array design, exhibits a delay reduction of up to 6%, an area improvement of up to 32% and a power reduction of up to 57% for varying accumulator bit-widths.
ISSN:1424-8220
1424-8220
DOI:10.3390/s23094297