A High Performance Multi-Bit-Width Booth Vector Systolic Accelerator for NAS Optimized Deep Learning Neural Networks

Multi-bit-width convolutional neural network (CNN) maintains the balance between network accuracy and hardware efficiency, thus enlightening a promising method for accurate yet energy-efficient edge computing. In this work, we develop state-of-the-art multi-bit-width accelerator for NAS Optimized de...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems. I, Regular papers Regular papers, 2022-09, Vol.69 (9), p.3619-3631
Hauptverfasser:	Huang, Mingqiang, Liu, Yucen, Man, Changhai, Li, Kai, Cheng, Quan, Mao, Wei, Yu, Hao
Format:	Artikel
Sprache:	eng
Schlagworte:	Arrays Artificial neural networks boldsymbol Multi-bit-width CNN Computer architecture Convolutional neural networks Data processing Deep learning Edge computing Field programmable gate arrays FPGA CNN Hardware Mathematical analysis NAS Neural networks systolic array Systolic arrays Training
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Multi-bit-width convolutional neural network (CNN) maintains the balance between network accuracy and hardware efficiency, thus enlightening a promising method for accurate yet energy-efficient edge computing. In this work, we develop state-of-the-art multi-bit-width accelerator for NAS Optimized deep learning neural networks. To efficiently process the multi-bit-width network inferencing, multi-level optimizations have been proposed. Firstly, differential Neural Architecture Search (NAS) method is adopted for the high accuracy multi-bit-width network generation. Secondly, hybrid Booth based multi-bit-width multiply-add-accumulation (MAC) unit is developed for data processing. Thirdly, vector systolic array is proposed for effectively accelerating the matrix multiplications. With vector-style systolic dataflow, both the processing time and logic resources consumption can be reduced when compared with the classical systolic array. Finally, The proposed multi-bit-width CNN acceleration scheme has been practically deployed on FPGA platform of Xilinx ZCU102. Average performance on accelerating the full NAS optimized VGG16 network is 784.2 GOPS, and peek performance of the convolutional layer can reach as high as 871.26 GOPS for INT8, 1676.96 GOPS for INT4, and 2863.29 GOPS for INT2 respectively, which is among the best results in previous CNN accelerator benchmarks.
ISSN:	1549-8328 1558-0806
DOI:	10.1109/TCSI.2022.3178474