A hardware-efficient computing engine for FPGA-based deep convolutional neural network accelerator

Deep convolutional neural networks (DCNNs) have recently emerged as a promising approach for computer vision tasks with many new DCNN architectures proposed to further improve their performance. However, the significant computation workload limits the deployment of such networks on embedded devices....

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Microelectronics 2022-10, Vol.128, p.105547, Article 105547
Hauptverfasser:	Li, Xueming, Huang, Hongmin, Chen, Taosheng, Gao, Huaien, Hu, Xianghong, Xiong, Xiaoming
Format:	Artikel
Sprache:	eng
Schlagworte:	DCNN accelerator FPGA Hardware efficiency Parallel computing Reconfigurability
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Deep convolutional neural networks (DCNNs) have recently emerged as a promising approach for computer vision tasks with many new DCNN architectures proposed to further improve their performance. However, the significant computation workload limits the deployment of such networks on embedded devices. Research on accelerating DCNN inference usually employs field-programmable gate arrays (FPGAs) due to their programmability. However, hardware efficiency and reconfigurability do not often receive sufficient attention. This paper proposes an efficient accelerator to support multiple DCNNs and improve the hardware utilization from three perspectives. First, a bandwidth-based tiling algorithm is used to improve the data transfer efficiency for direct memory access (DMA). Second, three parallel strategies are proposed to improve the utilization of the computing units (CUs). Third, a configurable CU is designed to improve the digital signal processor (DSP) utilization. The proposed accelerator is implemented on the Xilinx ZYNQ-7 ZC706 Evaluation Board at 200 MHz. The accelerator reaches 163 Giga Operations Per Second (GOPS) and 0.36 GOPS/DSP on the VGG-16 while consuming only 448 DSPs. A 0.24 GOPS/DSP is achieved with ResNet50 and 0.27 GOPS/DSP with YOLOv2-tiny. The experimental results demonstrate that this design achieves a better trade-off between hardware resource consumption, performance, and reconfigurability over previous works.
ISSN:	1879-2391 1879-2391
DOI:	10.1016/j.mejo.2022.105547