An Emerging NVM CIM Accelerator With Shared-Path Transpose Read and Bit-Interleaving Weight Storage for Efficient On-Chip Training in Edge Devices

Computing-in-memory (CIM) helps to improve the energy efficiency of computing by reducing data movement. In edge devices, it is necessary for CIM accelerators to support light-weighted on-chip training for adapting the model to environmental changes and ensuring edge data security. However, most of...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems. II, Express briefs Express briefs, 2023-07, Vol.70 (7), p.2645-2649
Hauptverfasser: Guo, Zhiwang, Chen, Deyang, Zhao, Chenyang, Fang, Jinbei, Jiang, Jingwen, Liu, Yixuan, Tian, Haidong, Xiong, Xiankui, Zhou, Keji, Xue, Xiaoyong, Liu, Qi, Zeng, Xiaoyang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Computing-in-memory (CIM) helps to improve the energy efficiency of computing by reducing data movement. In edge devices, it is necessary for CIM accelerators to support light-weighted on-chip training for adapting the model to environmental changes and ensuring edge data security. However, most of the previous CIM accelerators for edge devices only realize inference but with training performed on cloud. The support for on-chip training will lead to remarkable area cost and serious performance attenuation. In this brief, a CIM accelerator based on emerging nonvolatile memory (NVM) is presented with shared-path transpose read and bit-interleaving weight storage for efficient on-chip training in edge devices. The shared-path transpose read employs a new biasing scheme to eliminate the influence of body effect on the transpose read, improving both read margin and speed. The bit-interleaving weight storage splits the multi-bit weights into individual bits which are stored in the array alternately, speeding up the calculation of training process remarkably. For 8-bit inputs and weights, the evaluation in the 28nm process shows that the proposed accelerator achieves ~3.34/3.06 TOPS/W energy efficiency for feed-forward/ back-propagation, 4.6X lower computing latency, and reduces at least 20% chip size compared to the baseline design.
ISSN:1549-7747
1558-3791
DOI:10.1109/TCSII.2023.3240193