CIM SRAM for Signed In-Memory Broad-Purpose Computing From DSP to Neural Processing
This work introduces the ±CIM SRAM macro having the unique capability of performing in-memory multiply-and-accumulate computation with signed inputs and signed weights. This uniquely enables the execution of a broad set of workloads, ranging from storage, subsequent signal processing, and pre-condit...
Gespeichert in:
Veröffentlicht in: | IEEE journal of solid-state circuits 2021-10, Vol.56 (10), p.2981-2992 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This work introduces the ±CIM SRAM macro having the unique capability of performing in-memory multiply-and-accumulate computation with signed inputs and signed weights. This uniquely enables the execution of a broad set of workloads, ranging from storage, subsequent signal processing, and pre-conditioning or feature extraction to final convolutional neural network (CNN) computations. The ability to handle arbitrary input/weight sign in any operand within the same array and the same access cycle enables true end-to-end data locality, preserving the inherent benefits of in-memory computing along the entire signal chain. The proposed broad-purpose computing SRAM is based on a commercial 8T dual-port bitcell, and its simplicity allows the enhanced periphery to be pitch-matched with the array, making it amenable for automated design via memory compilers. The ±CIM pipelined architecture allows concurrent read/write and compute operations, avoiding the traditional memory unavailability in compute mode for improved throughput and easier system integration. A 40-nm test chip demonstrating the ±CIM architecture with adjustable input/weight precision exhibits an energy efficiency up to 41 TOPS/W, at an area (energy) overhead of 38% (25%) and negligible performance overhead compared to a compiled SRAM baseline. The sub-LSB computation mean-squared error associated with mismatch (0.38 LSB) and temporal noise (0.62 LSB) confirms the inherent robustness of the architecture. When used for neural network tasks (LeNet-5 and VGG), the accuracy drop is kept between 0.3% and 3.4%, compared to a double-precision software implementation. As an example of digital signal processing (DSP) workload, a frequency-domain feature extractor for voice activity detection keeps the accuracy drop lower than 3.8%. |
---|---|
ISSN: | 0018-9200 1558-173X |
DOI: | 10.1109/JSSC.2021.3092759 |