DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA

Convolutional Neural Networks (CNNs) have demonstrated outstanding accuracy among a range of machine learning tasks. However, the huge computational overhead limits their deployability in real-time applications. For this reason, parallel computing has been extensively employed to accelerate CNNs in...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computer-aided design of integrated circuits and systems 2024-07, p.1-1
Hauptverfasser:	Dai, Kui, Xie, Zheren, Liu, Shuanglong
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer architecture Convolution Convolutional codes Convolutional neural networks Convolutional Neural Networks (CNNs) Field programmable gate arrays Field Programmable Gate Arrays (FPGAs) Hardware Accelerator Kernel Loop Unrolling Parallel Computing Parallel processing
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1
container_issue
container_start_page	1
container_title	IEEE transactions on computer-aided design of integrated circuits and systems
container_volume
creator	Dai, Kui Xie, Zheren Liu, Shuanglong
description	Convolutional Neural Networks (CNNs) have demonstrated outstanding accuracy among a range of machine learning tasks. However, the huge computational overhead limits their deployability in real-time applications. For this reason, parallel computing has been extensively employed to accelerate CNNs in parallel computing devices such as GPUs and FPGAs, by unrolling multiple loop operations of convolutional layers. Nevertheless, existing CNN accelerators can hardly exploit different parallelisms offered by the CNN algorithms efficiently, since their degrees of parallelism are fixed at different dimensions and layers. In this paper, we propose the DCP-CNN, an FPGA-based CNN accelerator which implements the CNN with Dynamic Computing Parallelism degrees. DCP-CNN employs a parallel computing architecture which dynamically allocates the computing resources between different data dimensions of each layer based on layer size, to ensure that all computing units are working to full capacity and thus achieve optimal compute efficiency. Furthermore, in order to boost the performance of throughput, we propose a design space exploration (DSE) framework based on the simulated annealing method, which automatically generates the parallelism degrees between different dimensions of the network layers, according to the resource constraints and CNN structure. On Intel Stratix 10 GX650 FPGA, the proposed DCP-CNN achieves the throughput of more than 800 Gop/s and the compute efficiency of 72% ~ 98%, which outperforms the existing state-of-the-art FPGA-based CNN accelerators.
doi_str_mv	10.1109/TCAD.2024.3435996
format	Article
fullrecord	<record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TCAD_2024_3435996</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10616169</ieee_id><sourcerecordid>10_1109_TCAD_2024_3435996</sourcerecordid><originalsourceid>FETCH-LOGICAL-c639-2c2bcc46fbc78295edc11d573e6fb031dae3b2a6245ff3a428af58a3ed538963</originalsourceid><addsrcrecordid>eNpNkMtKw0AUhgdRsFYfQHAxL5A6Zy5Jxl1ILwqlBi24DNPJGR3JpWTiom_flHYhZ_HD-S-Lj5BHYDMApp-3eTafccblTEihtI6vyAS0SCIJCq7JhPEkjRhL2C25C-GXMZCK6wn5mOdFlG82L3ThnLce24Fm1mKNvRl819LO0dEO9MsPP3R-aE3jLc27Zv83-PabFqY3dY21Dw0d08tild2TG2fqgA8XnZLP5WKbv0br99Vbnq0jGwsdcct31srY7WyScq2wsgCVSgSOLyagMih23MRcKueEkTw1TqVGYKVEqmMxJXBetX0XQo-u3Pe-Mf2hBFaekJQnJOUJSXlBMnaezh2PiP_yMYynxRG7D1w6</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA</title><source>IEEE Xplore</source><creator>Dai, Kui ; Xie, Zheren ; Liu, Shuanglong</creator><creatorcontrib>Dai, Kui ; Xie, Zheren ; Liu, Shuanglong</creatorcontrib><description>Convolutional Neural Networks (CNNs) have demonstrated outstanding accuracy among a range of machine learning tasks. However, the huge computational overhead limits their deployability in real-time applications. For this reason, parallel computing has been extensively employed to accelerate CNNs in parallel computing devices such as GPUs and FPGAs, by unrolling multiple loop operations of convolutional layers. Nevertheless, existing CNN accelerators can hardly exploit different parallelisms offered by the CNN algorithms efficiently, since their degrees of parallelism are fixed at different dimensions and layers. In this paper, we propose the DCP-CNN, an FPGA-based CNN accelerator which implements the CNN with Dynamic Computing Parallelism degrees. DCP-CNN employs a parallel computing architecture which dynamically allocates the computing resources between different data dimensions of each layer based on layer size, to ensure that all computing units are working to full capacity and thus achieve optimal compute efficiency. Furthermore, in order to boost the performance of throughput, we propose a design space exploration (DSE) framework based on the simulated annealing method, which automatically generates the parallelism degrees between different dimensions of the network layers, according to the resource constraints and CNN structure. On Intel Stratix 10 GX650 FPGA, the proposed DCP-CNN achieves the throughput of more than 800 Gop/s and the compute efficiency of 72% ~ 98%, which outperforms the existing state-of-the-art FPGA-based CNN accelerators.</description><identifier>ISSN: 0278-0070</identifier><identifier>EISSN: 1937-4151</identifier><identifier>DOI: 10.1109/TCAD.2024.3435996</identifier><identifier>CODEN: ITCSDI</identifier><language>eng</language><publisher>IEEE</publisher><subject>Computer architecture ; Convolution ; Convolutional codes ; Convolutional neural networks ; Convolutional Neural Networks (CNNs) ; Field programmable gate arrays ; Field Programmable Gate Arrays (FPGAs) ; Hardware Accelerator ; Kernel ; Loop Unrolling ; Parallel Computing ; Parallel processing</subject><ispartof>IEEE transactions on computer-aided design of integrated circuits and systems, 2024-07, p.1-1</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-1513-1981</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10616169$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10616169$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Dai, Kui</creatorcontrib><creatorcontrib>Xie, Zheren</creatorcontrib><creatorcontrib>Liu, Shuanglong</creatorcontrib><title>DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA</title><title>IEEE transactions on computer-aided design of integrated circuits and systems</title><addtitle>TCAD</addtitle><description>Convolutional Neural Networks (CNNs) have demonstrated outstanding accuracy among a range of machine learning tasks. However, the huge computational overhead limits their deployability in real-time applications. For this reason, parallel computing has been extensively employed to accelerate CNNs in parallel computing devices such as GPUs and FPGAs, by unrolling multiple loop operations of convolutional layers. Nevertheless, existing CNN accelerators can hardly exploit different parallelisms offered by the CNN algorithms efficiently, since their degrees of parallelism are fixed at different dimensions and layers. In this paper, we propose the DCP-CNN, an FPGA-based CNN accelerator which implements the CNN with Dynamic Computing Parallelism degrees. DCP-CNN employs a parallel computing architecture which dynamically allocates the computing resources between different data dimensions of each layer based on layer size, to ensure that all computing units are working to full capacity and thus achieve optimal compute efficiency. Furthermore, in order to boost the performance of throughput, we propose a design space exploration (DSE) framework based on the simulated annealing method, which automatically generates the parallelism degrees between different dimensions of the network layers, according to the resource constraints and CNN structure. On Intel Stratix 10 GX650 FPGA, the proposed DCP-CNN achieves the throughput of more than 800 Gop/s and the compute efficiency of 72% ~ 98%, which outperforms the existing state-of-the-art FPGA-based CNN accelerators.</description><subject>Computer architecture</subject><subject>Convolution</subject><subject>Convolutional codes</subject><subject>Convolutional neural networks</subject><subject>Convolutional Neural Networks (CNNs)</subject><subject>Field programmable gate arrays</subject><subject>Field Programmable Gate Arrays (FPGAs)</subject><subject>Hardware Accelerator</subject><subject>Kernel</subject><subject>Loop Unrolling</subject><subject>Parallel Computing</subject><subject>Parallel processing</subject><issn>0278-0070</issn><issn>1937-4151</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkMtKw0AUhgdRsFYfQHAxL5A6Zy5Jxl1ILwqlBi24DNPJGR3JpWTiom_flHYhZ_HD-S-Lj5BHYDMApp-3eTafccblTEihtI6vyAS0SCIJCq7JhPEkjRhL2C25C-GXMZCK6wn5mOdFlG82L3ThnLce24Fm1mKNvRl819LO0dEO9MsPP3R-aE3jLc27Zv83-PabFqY3dY21Dw0d08tild2TG2fqgA8XnZLP5WKbv0br99Vbnq0jGwsdcct31srY7WyScq2wsgCVSgSOLyagMih23MRcKueEkTw1TqVGYKVEqmMxJXBetX0XQo-u3Pe-Mf2hBFaekJQnJOUJSXlBMnaezh2PiP_yMYynxRG7D1w6</recordid><startdate>20240730</startdate><enddate>20240730</enddate><creator>Dai, Kui</creator><creator>Xie, Zheren</creator><creator>Liu, Shuanglong</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-1513-1981</orcidid></search><sort><creationdate>20240730</creationdate><title>DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA</title><author>Dai, Kui ; Xie, Zheren ; Liu, Shuanglong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c639-2c2bcc46fbc78295edc11d573e6fb031dae3b2a6245ff3a428af58a3ed538963</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer architecture</topic><topic>Convolution</topic><topic>Convolutional codes</topic><topic>Convolutional neural networks</topic><topic>Convolutional Neural Networks (CNNs)</topic><topic>Field programmable gate arrays</topic><topic>Field Programmable Gate Arrays (FPGAs)</topic><topic>Hardware Accelerator</topic><topic>Kernel</topic><topic>Loop Unrolling</topic><topic>Parallel Computing</topic><topic>Parallel processing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Dai, Kui</creatorcontrib><creatorcontrib>Xie, Zheren</creatorcontrib><creatorcontrib>Liu, Shuanglong</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore</collection><collection>CrossRef</collection><jtitle>IEEE transactions on computer-aided design of integrated circuits and systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Dai, Kui</au><au>Xie, Zheren</au><au>Liu, Shuanglong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA</atitle><jtitle>IEEE transactions on computer-aided design of integrated circuits and systems</jtitle><stitle>TCAD</stitle><date>2024-07-30</date><risdate>2024</risdate><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>0278-0070</issn><eissn>1937-4151</eissn><coden>ITCSDI</coden><abstract>Convolutional Neural Networks (CNNs) have demonstrated outstanding accuracy among a range of machine learning tasks. However, the huge computational overhead limits their deployability in real-time applications. For this reason, parallel computing has been extensively employed to accelerate CNNs in parallel computing devices such as GPUs and FPGAs, by unrolling multiple loop operations of convolutional layers. Nevertheless, existing CNN accelerators can hardly exploit different parallelisms offered by the CNN algorithms efficiently, since their degrees of parallelism are fixed at different dimensions and layers. In this paper, we propose the DCP-CNN, an FPGA-based CNN accelerator which implements the CNN with Dynamic Computing Parallelism degrees. DCP-CNN employs a parallel computing architecture which dynamically allocates the computing resources between different data dimensions of each layer based on layer size, to ensure that all computing units are working to full capacity and thus achieve optimal compute efficiency. Furthermore, in order to boost the performance of throughput, we propose a design space exploration (DSE) framework based on the simulated annealing method, which automatically generates the parallelism degrees between different dimensions of the network layers, according to the resource constraints and CNN structure. On Intel Stratix 10 GX650 FPGA, the proposed DCP-CNN achieves the throughput of more than 800 Gop/s and the compute efficiency of 72% ~ 98%, which outperforms the existing state-of-the-art FPGA-based CNN accelerators.</abstract><pub>IEEE</pub><doi>10.1109/TCAD.2024.3435996</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-1513-1981</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 0278-0070
ispartof	IEEE transactions on computer-aided design of integrated circuits and systems, 2024-07, p.1-1
issn	0278-0070 1937-4151
language	eng
recordid	cdi_crossref_primary_10_1109_TCAD_2024_3435996
source	IEEE Xplore
subjects	Computer architecture Convolution Convolutional codes Convolutional neural networks Convolutional Neural Networks (CNNs) Field programmable gate arrays Field Programmable Gate Arrays (FPGAs) Hardware Accelerator Kernel Loop Unrolling Parallel Computing Parallel processing
title	DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T06%3A43%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=DCP-CNN:%20Efficient%20Acceleration%20of%20CNNs%20With%20Dynamic%20Computing%20Parallelism%20on%20FPGA&rft.jtitle=IEEE%20transactions%20on%20computer-aided%20design%20of%20integrated%20circuits%20and%20systems&rft.au=Dai,%20Kui&rft.date=2024-07-30&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=0278-0070&rft.eissn=1937-4151&rft.coden=ITCSDI&rft_id=info:doi/10.1109/TCAD.2024.3435996&rft_dat=%3Ccrossref_RIE%3E10_1109_TCAD_2024_3435996%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10616169&rfr_iscdi=true