A Floating-Point 6T SRAM In-Memory-Compute Macro Using Hybrid-Domain Structure for Advanced AI Edge Chips

Advanced artificial intelligence edge devices are expected to support floating-point (FP) multiply and accumulation operations while ensuring high energy efficiency and high inference accuracy. This work presents an FP compute-in-memory (CIM) macro that exploits the advantages of computing in the ti...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE journal of solid-state circuits 2024-01, Vol.59 (1), p.196-207
Hauptverfasser: Wu, Ping-Chun, Su, Jian-Wei, Hong, Li-Yang, Ren, Jin-Sheng, Chien, Chih-Han, Chen, Ho-Yu, Ke, Chao-En, Hsiao, Hsu-Ming, Li, Sih-Han, Sheu, Shyh-Shyuan, Lo, Wei-Chung, Chang, Shih-Chieh, Lo, Chung-Chuan, Liu, Ren-Shuo, Hsieh, Chih-Cheng, Tang, Kea-Tiong, Chang, Meng-Fan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Advanced artificial intelligence edge devices are expected to support floating-point (FP) multiply and accumulation operations while ensuring high energy efficiency and high inference accuracy. This work presents an FP compute-in-memory (CIM) macro that exploits the advantages of computing in the time, digital, and analog-voltage domain for high energy efficiency and accuracy. This work employs: 1) a hybrid-domain macrostructure to enable the computation of both the exponent and mantissa within the same CIM macro; 2) a time-domain computing scheme for energy-efficient exponent computation; 3) a product-exponent-based input-mantissa alignment scheme to enable the accumulation of the product mantissa in the same column; and 4) a place-value-dependent digital-analog-hybrid computing scheme to enable energy-efficient mantissa computations of sufficient accuracy. A 22-nm 832-kB FP-CIM macro fabricated using foundry-provided compact 6T-static random access memory (SRAM) cells achieved a high energy efficiency of 72.14 tera-floating-point operations per second (TFLOPS)/W while performing FP-multiply-and-accumulate (MAC) operations involving BF16-input, BF16-weight, FP32-output, and 128 accumulations.
ISSN:0018-9200
1558-173X
DOI:10.1109/JSSC.2023.3309966