A Reconfigurable 16Kb AND8T SRAM Macro With Improved Linearity for Multibit Compute-In Memory of Artificial Intelligence Edge Devices

Compute-in Memory (CIM) has been a promising candidate to perform the energy-efficient multiply-and-accumulate (MAC) operations of the modern Artificial Intelligence (AI) edge devices. This work proposes a multi-bit precision (4b input, 4b weight, and 4b output) 128 × 128 SRAM CIM architecture. The...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE journal on emerging and selected topics in circuits and systems 2022-06, Vol.12 (2), p.522-535
Hauptverfasser: Sharma, Vishal, Kim, Ju-Eon, Kim, Hyunjoon, Lu, Lu, Kim, Tony Tae-Hyoung
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Compute-in Memory (CIM) has been a promising candidate to perform the energy-efficient multiply-and-accumulate (MAC) operations of the modern Artificial Intelligence (AI) edge devices. This work proposes a multi-bit precision (4b input, 4b weight, and 4b output) 128 × 128 SRAM CIM architecture. The 4b input is implemented using the voltage-scaling and charge-sharing-based scheme. To achieve efficient computation with improved linearity, a novel AND-logic-based 8T SRAM cell (AND8T) is proposed. To address the non-idealities of analog voltage or current-based operations, the proposed AND8T employs the charge-domain-based computation by overlaying a metal-oxide-metal capacitor (MOM cap) with no area overhead. The proposed AND8T mitigates the linearity issue of MAC operations which is highly desirable for the reliable operation of complex neural networks (CNNs). The proposed 16Kb macro asserts 128 inputs in parallel and processes a 128 4b dot-product in a single cycle for the array column (a single neuron). The macro can also be reconfigured for the 64 or 32 4b parallel inputs based on the need of CNN models. The AND8T SRAM macro is fabricated in a 65nm node and achieves an energy efficiency of 301.08 TOPS/W for 16 parallel neurons output, with 128 4b MAC operations at 10MHz clock frequency and 1V supply. The implemented macro supports up to 100MHz of clock frequency and occupies 0.124mm2 of chip area while achieving the 96.05% and 87% classification accuracy for MNIST and CIFAR-10 datasets.
ISSN:2156-3357
2156-3365
DOI:10.1109/JETCAS.2022.3168571