Toward an Efficient Deep Pipelined Template-Based Architecture for Accelerating the Entire 2-D and 3-D CNNs on FPGA

3-D convolutional neural networks (3-D CNNs) are used efficiently in many computer vision applications. Most previous work in this area has concentrated only on design and optimization of accelerators for 2-D CNNs, with few attempts having been made to accelerate 3-D CNNs on FPGA. We find the accele...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computer-aided design of integrated circuits and systems 2020-07, Vol.39 (7), p.1442-1455
Hauptverfasser:	Shen, Junzhong, Huang, You, Wen, Mei, Zhang, Chunyuan
Format:	Artikel
Sprache:	eng
Schlagworte:	3-D convolutional neural networks (3-D CNNs) Acceleration Accelerators Algorithms Artificial neural networks Clustering Computer architecture Computer vision Convolution Design optimization Field programmable gate arrays Mapping Space exploration Three-dimensional displays Two dimensional displays uniform templates Winograd algorithm Workload
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	3-D convolutional neural networks (3-D CNNs) are used efficiently in many computer vision applications. Most previous work in this area has concentrated only on design and optimization of accelerators for 2-D CNNs, with few attempts having been made to accelerate 3-D CNNs on FPGA. We find the acceleration of 3-D CNNs on FPGA to be challenging due to their high computational complexity and storage demands. More importantly, although the computational patterns of 2-D and 3-D CNNs are analogous, the conventional approaches that have been adopted for acceleration of 2-D CNNs may be unfit for 3-D CNN acceleration. In this paper, in order to accelerate 2-D and 3-D CNNs using a uniform framework, we first propose a uniform template-based architecture that uses templates based on the Winograd algorithm to ensure the rapid development of 2-D and 3-D CNN accelerators. Then, with the aim of efficiently mapping all layers of 2-D /3-D CNNs onto a pipelined accelerator, techniques are developed to improve the throughput and computational efficiency of the accelerator, including layer fusion, layer clustering, and workload-balancing scheme. Finally, we demonstrate the effectiveness of the deep pipelined architecture by accelerating real-life 2-D and 3-D CNNs on the state-of-the-art FPGA platform. On VCU118, we achieve 3.7 TOPS for VGG-16, which outperforms state-of-the-art FPGA-based CNN accelerators. Comparisons with CPU and GPU solutions demonstrate that our implementation of 3-D CNN achieves gains of up to 17.8\times and 64.2\times in performance and energy relative to a CPU solution, and a 5.0\times energy efficiency gain over a GPU solution.
ISSN:	0278-0070 1937-4151
DOI:	10.1109/TCAD.2019.2912894