A dedicated hardware accelerator for real-time acceleration of YOLOv2

In recent years, dedicated hardware accelerators for the acceleration of the convolutional neural network (CNN) have been extensively studied. Although many studies have presented efficient designs on FPGAs for image classification neural network models such as AlexNet and VGG, there are still littl...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of real-time image processing 2021-06, Vol.18 (3), p.481-492
Hauptverfasser:	Xu, Ke, Wang, Xiaoyun, Liu, Xinyang, Cao, Changfeng, Li, Huolin, Peng, Haiyong, Wang, Dong
Format:	Artikel
Sprache:	eng
Schlagworte:	Acceleration Accuracy Algorithms Artificial neural networks Bandwidths Circuits Computer Graphics Computer Science Design Field programmable gate arrays Hardware Image classification Image Processing and Computer Vision Multimedia Information Systems Neural networks Object recognition Original Research Paper Pattern Recognition Performance evaluation Pipeline design Power Resource utilization Signal,Image and Speech Processing Workloads
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In recent years, dedicated hardware accelerators for the acceleration of the convolutional neural network (CNN) have been extensively studied. Although many studies have presented efficient designs on FPGAs for image classification neural network models such as AlexNet and VGG, there are still little implementations for CNN-based object detection applications. This paper presents an OpenCL-based high-throughput FPGA accelerator for the YOLOv2 object detection algorithm on Arria-10 GX1150 FPGA. The proposed hardware architecture adopts a scalable pipeline design to support multi-resolution input image and full 8-bit fixed-point datapath to improve hardware resource utilization. Layer fusion technology that merges the convolution, batch normalization and Leaky-ReLU is also developed to avoid transmission of intermediate data between FPGA and external memory. Experimental results show that the final design achieves a peak throughput of 566 GOP/s under the working frequency of 190 MHz. The accelerator can execute YOLOv2 inference computation ( 288 × 288 resolution) and tiny YOLOv2 ( 416 × 416 resolution) at the speed of 35 and 71 FPS, respectively.
ISSN:	1861-8200 1861-8219
DOI:	10.1007/s11554-020-00977-w