DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA

Convolutional Neural Networks (CNNs) have demonstrated outstanding accuracy among a range of machine learning tasks. However, the huge computational overhead limits their deployability in real-time applications. For this reason, parallel computing has been extensively employed to accelerate CNNs in...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computer-aided design of integrated circuits and systems 2024-07, p.1-1
Hauptverfasser:	Dai, Kui, Xie, Zheren, Liu, Shuanglong
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer architecture Convolution Convolutional codes Convolutional neural networks Convolutional Neural Networks (CNNs) Field programmable gate arrays Field Programmable Gate Arrays (FPGAs) Hardware Accelerator Kernel Loop Unrolling Parallel Computing Parallel processing
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Convolutional Neural Networks (CNNs) have demonstrated outstanding accuracy among a range of machine learning tasks. However, the huge computational overhead limits their deployability in real-time applications. For this reason, parallel computing has been extensively employed to accelerate CNNs in parallel computing devices such as GPUs and FPGAs, by unrolling multiple loop operations of convolutional layers. Nevertheless, existing CNN accelerators can hardly exploit different parallelisms offered by the CNN algorithms efficiently, since their degrees of parallelism are fixed at different dimensions and layers. In this paper, we propose the DCP-CNN, an FPGA-based CNN accelerator which implements the CNN with Dynamic Computing Parallelism degrees. DCP-CNN employs a parallel computing architecture which dynamically allocates the computing resources between different data dimensions of each layer based on layer size, to ensure that all computing units are working to full capacity and thus achieve optimal compute efficiency. Furthermore, in order to boost the performance of throughput, we propose a design space exploration (DSE) framework based on the simulated annealing method, which automatically generates the parallelism degrees between different dimensions of the network layers, according to the resource constraints and CNN structure. On Intel Stratix 10 GX650 FPGA, the proposed DCP-CNN achieves the throughput of more than 800 Gop/s and the compute efficiency of 72% ~ 98%, which outperforms the existing state-of-the-art FPGA-based CNN accelerators.
ISSN:	0278-0070 1937-4151
DOI:	10.1109/TCAD.2024.3435996