Balanced Data Placement for GEMV Acceleration with Processing-In-Memory
With unprecedented demand for generative AI (GenAI) inference, acceleration of primitives that dominate GenAI such as general matrix-vector multiplication (GEMV) is receiving considerable attention. A challenge with GEMVs is the high memory bandwidth this primitive demands. Multiple memory vendors h...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | With unprecedented demand for generative AI (GenAI) inference, acceleration
of primitives that dominate GenAI such as general matrix-vector multiplication
(GEMV) is receiving considerable attention. A challenge with GEMVs is the high
memory bandwidth this primitive demands. Multiple memory vendors have proposed
commercially viable processing-in-memory (PIM) prototypes that attain bandwidth
boost over processor via augmenting memory banks with compute capabilities and
broadcasting same command to all banks. While proposed PIM designs stand to
accelerate GEMV, we observe in this work that a key impediment to truly harness
PIM acceleration is deducing optimal data-placement to place the matrix in
memory banks. To this end, we tease out several factors that impact
data-placement and propose PIMnast methodology which, like a gymnast, balances
these factors to identify data-placements that deliver GEMV acceleration.
Across a spectrum of GenAI models, our proposed PIMnast methodology along with
additional orchestration knobs we identify delivers up to 6.86$\times$ speedup
for GEMVs (of the available 7$\times$ roofline speedup) leading to up to
5$\times$ speedup for per-token latencies. |
---|---|
DOI: | 10.48550/arxiv.2403.20297 |