DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA

Convolutional Neural Networks (CNNs) have demonstrated outstanding accuracy among a range of machine learning tasks. However, the huge computational overhead limits their deployability in real-time applications. For this reason, parallel computing has been extensively employed to accelerate CNNs in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on computer-aided design of integrated circuits and systems 2024-07, p.1-1
Hauptverfasser: Dai, Kui, Xie, Zheren, Liu, Shuanglong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1
container_issue
container_start_page 1
container_title IEEE transactions on computer-aided design of integrated circuits and systems
container_volume
creator Dai, Kui
Xie, Zheren
Liu, Shuanglong
description Convolutional Neural Networks (CNNs) have demonstrated outstanding accuracy among a range of machine learning tasks. However, the huge computational overhead limits their deployability in real-time applications. For this reason, parallel computing has been extensively employed to accelerate CNNs in parallel computing devices such as GPUs and FPGAs, by unrolling multiple loop operations of convolutional layers. Nevertheless, existing CNN accelerators can hardly exploit different parallelisms offered by the CNN algorithms efficiently, since their degrees of parallelism are fixed at different dimensions and layers. In this paper, we propose the DCP-CNN, an FPGA-based CNN accelerator which implements the CNN with Dynamic Computing Parallelism degrees. DCP-CNN employs a parallel computing architecture which dynamically allocates the computing resources between different data dimensions of each layer based on layer size, to ensure that all computing units are working to full capacity and thus achieve optimal compute efficiency. Furthermore, in order to boost the performance of throughput, we propose a design space exploration (DSE) framework based on the simulated annealing method, which automatically generates the parallelism degrees between different dimensions of the network layers, according to the resource constraints and CNN structure. On Intel Stratix 10 GX650 FPGA, the proposed DCP-CNN achieves the throughput of more than 800 Gop/s and the compute efficiency of 72% ~ 98%, which outperforms the existing state-of-the-art FPGA-based CNN accelerators.
doi_str_mv 10.1109/TCAD.2024.3435996
format Article
fullrecord <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TCAD_2024_3435996</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10616169</ieee_id><sourcerecordid>10_1109_TCAD_2024_3435996</sourcerecordid><originalsourceid>FETCH-LOGICAL-c639-2c2bcc46fbc78295edc11d573e6fb031dae3b2a6245ff3a428af58a3ed538963</originalsourceid><addsrcrecordid>eNpNkMtKw0AUhgdRsFYfQHAxL5A6Zy5Jxl1ILwqlBi24DNPJGR3JpWTiom_flHYhZ_HD-S-Lj5BHYDMApp-3eTafccblTEihtI6vyAS0SCIJCq7JhPEkjRhL2C25C-GXMZCK6wn5mOdFlG82L3ThnLce24Fm1mKNvRl819LO0dEO9MsPP3R-aE3jLc27Zv83-PabFqY3dY21Dw0d08tild2TG2fqgA8XnZLP5WKbv0br99Vbnq0jGwsdcct31srY7WyScq2wsgCVSgSOLyagMih23MRcKueEkTw1TqVGYKVEqmMxJXBetX0XQo-u3Pe-Mf2hBFaekJQnJOUJSXlBMnaezh2PiP_yMYynxRG7D1w6</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA</title><source>IEEE Xplore</source><creator>Dai, Kui ; Xie, Zheren ; Liu, Shuanglong</creator><creatorcontrib>Dai, Kui ; Xie, Zheren ; Liu, Shuanglong</creatorcontrib><description>Convolutional Neural Networks (CNNs) have demonstrated outstanding accuracy among a range of machine learning tasks. However, the huge computational overhead limits their deployability in real-time applications. For this reason, parallel computing has been extensively employed to accelerate CNNs in parallel computing devices such as GPUs and FPGAs, by unrolling multiple loop operations of convolutional layers. Nevertheless, existing CNN accelerators can hardly exploit different parallelisms offered by the CNN algorithms efficiently, since their degrees of parallelism are fixed at different dimensions and layers. In this paper, we propose the DCP-CNN, an FPGA-based CNN accelerator which implements the CNN with Dynamic Computing Parallelism degrees. DCP-CNN employs a parallel computing architecture which dynamically allocates the computing resources between different data dimensions of each layer based on layer size, to ensure that all computing units are working to full capacity and thus achieve optimal compute efficiency. Furthermore, in order to boost the performance of throughput, we propose a design space exploration (DSE) framework based on the simulated annealing method, which automatically generates the parallelism degrees between different dimensions of the network layers, according to the resource constraints and CNN structure. On Intel Stratix 10 GX650 FPGA, the proposed DCP-CNN achieves the throughput of more than 800 Gop/s and the compute efficiency of 72% ~ 98%, which outperforms the existing state-of-the-art FPGA-based CNN accelerators.</description><identifier>ISSN: 0278-0070</identifier><identifier>EISSN: 1937-4151</identifier><identifier>DOI: 10.1109/TCAD.2024.3435996</identifier><identifier>CODEN: ITCSDI</identifier><language>eng</language><publisher>IEEE</publisher><subject>Computer architecture ; Convolution ; Convolutional codes ; Convolutional neural networks ; Convolutional Neural Networks (CNNs) ; Field programmable gate arrays ; Field Programmable Gate Arrays (FPGAs) ; Hardware Accelerator ; Kernel ; Loop Unrolling ; Parallel Computing ; Parallel processing</subject><ispartof>IEEE transactions on computer-aided design of integrated circuits and systems, 2024-07, p.1-1</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-1513-1981</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10616169$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10616169$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Dai, Kui</creatorcontrib><creatorcontrib>Xie, Zheren</creatorcontrib><creatorcontrib>Liu, Shuanglong</creatorcontrib><title>DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA</title><title>IEEE transactions on computer-aided design of integrated circuits and systems</title><addtitle>TCAD</addtitle><description>Convolutional Neural Networks (CNNs) have demonstrated outstanding accuracy among a range of machine learning tasks. However, the huge computational overhead limits their deployability in real-time applications. For this reason, parallel computing has been extensively employed to accelerate CNNs in parallel computing devices such as GPUs and FPGAs, by unrolling multiple loop operations of convolutional layers. Nevertheless, existing CNN accelerators can hardly exploit different parallelisms offered by the CNN algorithms efficiently, since their degrees of parallelism are fixed at different dimensions and layers. In this paper, we propose the DCP-CNN, an FPGA-based CNN accelerator which implements the CNN with Dynamic Computing Parallelism degrees. DCP-CNN employs a parallel computing architecture which dynamically allocates the computing resources between different data dimensions of each layer based on layer size, to ensure that all computing units are working to full capacity and thus achieve optimal compute efficiency. Furthermore, in order to boost the performance of throughput, we propose a design space exploration (DSE) framework based on the simulated annealing method, which automatically generates the parallelism degrees between different dimensions of the network layers, according to the resource constraints and CNN structure. On Intel Stratix 10 GX650 FPGA, the proposed DCP-CNN achieves the throughput of more than 800 Gop/s and the compute efficiency of 72% ~ 98%, which outperforms the existing state-of-the-art FPGA-based CNN accelerators.</description><subject>Computer architecture</subject><subject>Convolution</subject><subject>Convolutional codes</subject><subject>Convolutional neural networks</subject><subject>Convolutional Neural Networks (CNNs)</subject><subject>Field programmable gate arrays</subject><subject>Field Programmable Gate Arrays (FPGAs)</subject><subject>Hardware Accelerator</subject><subject>Kernel</subject><subject>Loop Unrolling</subject><subject>Parallel Computing</subject><subject>Parallel processing</subject><issn>0278-0070</issn><issn>1937-4151</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkMtKw0AUhgdRsFYfQHAxL5A6Zy5Jxl1ILwqlBi24DNPJGR3JpWTiom_flHYhZ_HD-S-Lj5BHYDMApp-3eTafccblTEihtI6vyAS0SCIJCq7JhPEkjRhL2C25C-GXMZCK6wn5mOdFlG82L3ThnLce24Fm1mKNvRl819LO0dEO9MsPP3R-aE3jLc27Zv83-PabFqY3dY21Dw0d08tild2TG2fqgA8XnZLP5WKbv0br99Vbnq0jGwsdcct31srY7WyScq2wsgCVSgSOLyagMih23MRcKueEkTw1TqVGYKVEqmMxJXBetX0XQo-u3Pe-Mf2hBFaekJQnJOUJSXlBMnaezh2PiP_yMYynxRG7D1w6</recordid><startdate>20240730</startdate><enddate>20240730</enddate><creator>Dai, Kui</creator><creator>Xie, Zheren</creator><creator>Liu, Shuanglong</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-1513-1981</orcidid></search><sort><creationdate>20240730</creationdate><title>DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA</title><author>Dai, Kui ; Xie, Zheren ; Liu, Shuanglong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c639-2c2bcc46fbc78295edc11d573e6fb031dae3b2a6245ff3a428af58a3ed538963</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer architecture</topic><topic>Convolution</topic><topic>Convolutional codes</topic><topic>Convolutional neural networks</topic><topic>Convolutional Neural Networks (CNNs)</topic><topic>Field programmable gate arrays</topic><topic>Field Programmable Gate Arrays (FPGAs)</topic><topic>Hardware Accelerator</topic><topic>Kernel</topic><topic>Loop Unrolling</topic><topic>Parallel Computing</topic><topic>Parallel processing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Dai, Kui</creatorcontrib><creatorcontrib>Xie, Zheren</creatorcontrib><creatorcontrib>Liu, Shuanglong</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore</collection><collection>CrossRef</collection><jtitle>IEEE transactions on computer-aided design of integrated circuits and systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Dai, Kui</au><au>Xie, Zheren</au><au>Liu, Shuanglong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA</atitle><jtitle>IEEE transactions on computer-aided design of integrated circuits and systems</jtitle><stitle>TCAD</stitle><date>2024-07-30</date><risdate>2024</risdate><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>0278-0070</issn><eissn>1937-4151</eissn><coden>ITCSDI</coden><abstract>Convolutional Neural Networks (CNNs) have demonstrated outstanding accuracy among a range of machine learning tasks. However, the huge computational overhead limits their deployability in real-time applications. For this reason, parallel computing has been extensively employed to accelerate CNNs in parallel computing devices such as GPUs and FPGAs, by unrolling multiple loop operations of convolutional layers. Nevertheless, existing CNN accelerators can hardly exploit different parallelisms offered by the CNN algorithms efficiently, since their degrees of parallelism are fixed at different dimensions and layers. In this paper, we propose the DCP-CNN, an FPGA-based CNN accelerator which implements the CNN with Dynamic Computing Parallelism degrees. DCP-CNN employs a parallel computing architecture which dynamically allocates the computing resources between different data dimensions of each layer based on layer size, to ensure that all computing units are working to full capacity and thus achieve optimal compute efficiency. Furthermore, in order to boost the performance of throughput, we propose a design space exploration (DSE) framework based on the simulated annealing method, which automatically generates the parallelism degrees between different dimensions of the network layers, according to the resource constraints and CNN structure. On Intel Stratix 10 GX650 FPGA, the proposed DCP-CNN achieves the throughput of more than 800 Gop/s and the compute efficiency of 72% ~ 98%, which outperforms the existing state-of-the-art FPGA-based CNN accelerators.</abstract><pub>IEEE</pub><doi>10.1109/TCAD.2024.3435996</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-1513-1981</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 0278-0070
ispartof IEEE transactions on computer-aided design of integrated circuits and systems, 2024-07, p.1-1
issn 0278-0070
1937-4151
language eng
recordid cdi_crossref_primary_10_1109_TCAD_2024_3435996
source IEEE Xplore
subjects Computer architecture
Convolution
Convolutional codes
Convolutional neural networks
Convolutional Neural Networks (CNNs)
Field programmable gate arrays
Field Programmable Gate Arrays (FPGAs)
Hardware Accelerator
Kernel
Loop Unrolling
Parallel Computing
Parallel processing
title DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T06%3A43%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=DCP-CNN:%20Efficient%20Acceleration%20of%20CNNs%20With%20Dynamic%20Computing%20Parallelism%20on%20FPGA&rft.jtitle=IEEE%20transactions%20on%20computer-aided%20design%20of%20integrated%20circuits%20and%20systems&rft.au=Dai,%20Kui&rft.date=2024-07-30&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=0278-0070&rft.eissn=1937-4151&rft.coden=ITCSDI&rft_id=info:doi/10.1109/TCAD.2024.3435996&rft_dat=%3Ccrossref_RIE%3E10_1109_TCAD_2024_3435996%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10616169&rfr_iscdi=true