CIM SRAM for Signed In-Memory Broad-Purpose Computing From DSP to Neural Processing

This work introduces the ±CIM SRAM macro having the unique capability of performing in-memory multiply-and-accumulate computation with signed inputs and signed weights. This uniquely enables the execution of a broad set of workloads, ranging from storage, subsequent signal processing, and pre-condit...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE journal of solid-state circuits 2021-10, Vol.56 (10), p.2981-2992
Hauptverfasser:	Jain, Saurabh, Lin, Longyang, Alioto, Massimo
Format:	Artikel
Sprache:	eng
Schlagworte:	Arrays Artificial neural networks Common Information Model (computing) Computation Computer architecture Digital signal processing Digital signal processing (DSP) Feature extraction in-memory computing machine learning neural network Neural networks Preconditioning Random access memory Signal processing Static random access memory Task analysis Time-domain analysis Voice activity detectors Voice recognition Weight Workload
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This work introduces the ±CIM SRAM macro having the unique capability of performing in-memory multiply-and-accumulate computation with signed inputs and signed weights. This uniquely enables the execution of a broad set of workloads, ranging from storage, subsequent signal processing, and pre-conditioning or feature extraction to final convolutional neural network (CNN) computations. The ability to handle arbitrary input/weight sign in any operand within the same array and the same access cycle enables true end-to-end data locality, preserving the inherent benefits of in-memory computing along the entire signal chain. The proposed broad-purpose computing SRAM is based on a commercial 8T dual-port bitcell, and its simplicity allows the enhanced periphery to be pitch-matched with the array, making it amenable for automated design via memory compilers. The ±CIM pipelined architecture allows concurrent read/write and compute operations, avoiding the traditional memory unavailability in compute mode for improved throughput and easier system integration. A 40-nm test chip demonstrating the ±CIM architecture with adjustable input/weight precision exhibits an energy efficiency up to 41 TOPS/W, at an area (energy) overhead of 38% (25%) and negligible performance overhead compared to a compiled SRAM baseline. The sub-LSB computation mean-squared error associated with mismatch (0.38 LSB) and temporal noise (0.62 LSB) confirms the inherent robustness of the architecture. When used for neural network tasks (LeNet-5 and VGG), the accuracy drop is kept between 0.3% and 3.4%, compared to a double-precision software implementation. As an example of digital signal processing (DSP) workload, a frequency-domain feature extractor for voice activity detection keeps the accuracy drop lower than 3.8%.
ISSN:	0018-9200 1558-173X
DOI:	10.1109/JSSC.2021.3092759