BitBlade: Energy-Efficient Variable Bit-Precision Hardware Accelerator for Quantized Neural Networks

We introduce an area/energy-efficient precision-scalable neural network accelerator architecture. Previous precision-scalable hardware accelerators have limitations such as the under-utilization of multipliers for low bit-width operations and the large area overhead to support various bit precisions...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE journal of solid-state circuits 2022-06, Vol.57 (6), p.1924-1935
Hauptverfasser:	Ryu, Sungju, Kim, Hyungjun, Yi, Wooseok, Kim, Eunhwan, Kim, Yulhwa, Kim, Taesu, Kim, Jae-Joon
Format:	Artikel
Sprache:	eng
Schlagworte:	Adders Arrays Bit-precision scaling bitwise summation channel-first and pixel-last tiling (CFPL) channel-wise aligning Computer architecture deep neural network Energy efficiency Hardware Hardware acceleration hardware accelerator Multipliers multiply–accumulate unit Neural networks Random access memory Throughput Tiling
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	We introduce an area/energy-efficient precision-scalable neural network accelerator architecture. Previous precision-scalable hardware accelerators have limitations such as the under-utilization of multipliers for low bit-width operations and the large area overhead to support various bit precisions. To mitigate the problems, we first propose a bitwise summation, which reduces the area overhead for the bit-width scaling. In addition, we present a channel-wise aligning scheme (CAS) to efficiently fetch inputs and weights from on-chip SRAM buffers and a channel-first and pixel-last tiling (CFPL) scheme to maximize the utilization of multipliers on various kernel sizes. A test chip was implemented in 28-nm CMOS technology, and the experimental results show that the throughput and energy efficiency of our chip are up to 7.7 \times and 1.64 \times higher than those of the state-of-the-art designs, respectively. Moreover, additional 1.5-3.4 \times throughput gains can be achieved using the CFPL method compared to the CAS.
ISSN:	0018-9200 1558-173X
DOI:	10.1109/JSSC.2022.3141050