An Efficient Two-Stage Pipelined Compute-in-Memory Macro for Accelerating Transformer Feed-Forward Networks
Transformer architectures have achieved state-of-the-art performance in various applications. However, deploying transformer models on resource-constrained platforms is still challenging due to its dynamic workloads, intensive computations, and substantial memory access. In this article, we propose...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on very large scale integration (VLSI) systems 2024-10, Vol.32 (10), p.1889-1899 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Transformer architectures have achieved state-of-the-art performance in various applications. However, deploying transformer models on resource-constrained platforms is still challenging due to its dynamic workloads, intensive computations, and substantial memory access. In this article, we propose a two-stage pipelined compute-in-memory (CIM) macro for effectively deploying and accelerating the feed-forward network (FFN) layers of transformer models. Two independent CIM arrays are designed to execute the two distinct linear projections in FFN layers, which are interconnected by co-designed analog rectified linear unit (ReLU) circuits to realize the nonlinear activation function. The analog multiply-and-add (MAC) results from the first CIM array are streamed directly to the analog ReLU circuits, and subsequently to the next CIM array for performing another linear projection. This architecture eliminates the need for analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) for internal results' staging, thereby enhancing overall macro efficiency and reducing computing latency. A proof-of-concept macro is fabricated using TSMC 65-nm process and achieves 4.096 TOPS peak throughput, 4.39 TOPS/mm2 area efficiency, and 49.83 TOPS/W energy efficiency. To map transformer models onto the proposed macro, we quantize the FFN layers of BERTMINI model under per-token granularity for activations and per-tensor granularity for weights using quantization-aware training (QAT), which exhibits excellent accuracy across multiple benchmarks. |
---|---|
ISSN: | 1063-8210 1557-9999 |
DOI: | 10.1109/TVLSI.2024.3432403 |