Hardware-Friendly 3-D CNN Acceleration With Balanced Kernel Group Sparsity

Being capable of extracting more information than 2-D convolutional neural networks (CNNs), 3-D CNNs have been playing a vital role in video analysis tasks like human action recognition, but their massive operations hinder the real-time execution on edge devices with constrained computation and memo...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on computer-aided design of integrated circuits and systems 2024-10, Vol.43 (10), p.3027-3040
Hauptverfasser:	Sun, Mengshu, Xu, Kaidi, Lin, Xue, Hu, Yongli, Yin, Baocai
Format:	Artikel
Sprache:	eng
Schlagworte:	3-D convolutional neural network (CNN) Computational modeling Convolutional neural networks edge device inference Field programmable gate arrays FPGA Kernel model compression Parallel processing Quantization (signal) Three-dimensional displays weight pruning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	3040
container_issue	10
container_start_page	3027
container_title	IEEE transactions on computer-aided design of integrated circuits and systems
container_volume	43
creator	Sun, Mengshu Xu, Kaidi Lin, Xue Hu, Yongli Yin, Baocai
description	Being capable of extracting more information than 2-D convolutional neural networks (CNNs), 3-D CNNs have been playing a vital role in video analysis tasks like human action recognition, but their massive operations hinder the real-time execution on edge devices with constrained computation and memory resources. Although various model compression techniques have been applied to accelerate 2-D CNNs, there are rare efforts in investigating hardware-friendly pruning of 3-D CNNs and acceleration on customizable edge platforms like FPGAs. This work starts from proposing a kernel group row-column (KGRC) weight sparsity pattern, which is fine-grained to achieve high pruning ratios with negligible accuracy loss, and balanced across kernel groups to achieve high computation parallelism on hardware. The reweighted pruning algorithm for this sparsity is then presented and performed on 3-D CNNs, followed by quantization under different precisions. Along with model compression, FPGA-based accelerators with four modes are designed in support of the kernel group sparsity in multiple dimensions. The co-design framework of the pruning algorithm and the accelerator is tested on two representative 3-D CNNs, namely C3D and R(2+1)D, with the Xilinx ZCU102 FPGA platform for action recognition. The experimental results indicate that the accelerator implementation with the KGRC sparsity and 8-bit quantization achieves a good balance between the speedup and model accuracy, leading to acceleration ratios of 4.12\times for C3D and 3.85\times for R(2+1)D compared with the 16-bit baseline designs supporting only dense models.
doi_str_mv	10.1109/TCAD.2024.3390040
format	Article
fullrecord	<record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TCAD_2024_3390040</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10502121</ieee_id><sourcerecordid>10_1109_TCAD_2024_3390040</sourcerecordid><originalsourceid>FETCH-LOGICAL-c148t-9c462e2a9ee1b24c5cd85231b66d28fe4aadd4c8df08278fda81da21282054c13</originalsourceid><addsrcrecordid>eNpNkMFOwkAURSdGExH9ABMX8wOD781MYbrEIqASXIhx2TxmXmNNLWRaY_r3lsDC1d3ce3JzhLhFGCFCer_JprORBm1HxqQAFs7EAFMzURYTPBcD0BOnACZwKa6a5gsAbaLTgXheUgy_FFnNY8l1qDpp1Exm67Wces8VR2rLXS0_yvZTPlBFtecgXzjWXMlF3P3s5dueYlO23bW4KKhq-OaUQ_E-f9xkS7V6XTxl05XyaF2rUm_HmjWlzLjV1ic-uEQb3I7HQbuCLVEI1rtQgOtPF4EcBtKonYbEejRDgUeuj7umiVzk-1h-U-xyhPwgIz_IyA8y8pOMfnN33JTM_K-fQA9G8wepL1pc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Hardware-Friendly 3-D CNN Acceleration With Balanced Kernel Group Sparsity</title><source>IEEE Electronic Library (IEL)</source><creator>Sun, Mengshu ; Xu, Kaidi ; Lin, Xue ; Hu, Yongli ; Yin, Baocai</creator><creatorcontrib>Sun, Mengshu ; Xu, Kaidi ; Lin, Xue ; Hu, Yongli ; Yin, Baocai</creatorcontrib><description><![CDATA[Being capable of extracting more information than 2-D convolutional neural networks (CNNs), 3-D CNNs have been playing a vital role in video analysis tasks like human action recognition, but their massive operations hinder the real-time execution on edge devices with constrained computation and memory resources. Although various model compression techniques have been applied to accelerate 2-D CNNs, there are rare efforts in investigating hardware-friendly pruning of 3-D CNNs and acceleration on customizable edge platforms like FPGAs. This work starts from proposing a kernel group row-column (KGRC) weight sparsity pattern, which is fine-grained to achieve high pruning ratios with negligible accuracy loss, and balanced across kernel groups to achieve high computation parallelism on hardware. The reweighted pruning algorithm for this sparsity is then presented and performed on 3-D CNNs, followed by quantization under different precisions. Along with model compression, FPGA-based accelerators with four modes are designed in support of the kernel group sparsity in multiple dimensions. The co-design framework of the pruning algorithm and the accelerator is tested on two representative 3-D CNNs, namely C3D and R(2+1)D, with the Xilinx ZCU102 FPGA platform for action recognition. The experimental results indicate that the accelerator implementation with the KGRC sparsity and 8-bit quantization achieves a good balance between the speedup and model accuracy, leading to acceleration ratios of <inline-formula> <tex-math notation="LaTeX">4.12\times </tex-math></inline-formula> for C3D and <inline-formula> <tex-math notation="LaTeX">3.85\times </tex-math></inline-formula> for R(2+1)D compared with the 16-bit baseline designs supporting only dense models.]]></description><identifier>ISSN: 0278-0070</identifier><identifier>EISSN: 1937-4151</identifier><identifier>DOI: 10.1109/TCAD.2024.3390040</identifier><identifier>CODEN: ITCSDI</identifier><language>eng</language><publisher>IEEE</publisher><subject>3-D convolutional neural network (CNN) ; Computational modeling ; Convolutional neural networks ; edge device inference ; Field programmable gate arrays ; FPGA ; Kernel ; model compression ; Parallel processing ; Quantization (signal) ; Three-dimensional displays ; weight pruning</subject><ispartof>IEEE transactions on computer-aided design of integrated circuits and systems, 2024-10, Vol.43 (10), p.3027-3040</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c148t-9c462e2a9ee1b24c5cd85231b66d28fe4aadd4c8df08278fda81da21282054c13</cites><orcidid>0000-0001-6210-8883 ; 0000-0003-0440-438X ; 0000-0003-3540-1464 ; 0000-0003-3121-1823 ; 0000-0003-4437-0671</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10502121$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10502121$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Sun, Mengshu</creatorcontrib><creatorcontrib>Xu, Kaidi</creatorcontrib><creatorcontrib>Lin, Xue</creatorcontrib><creatorcontrib>Hu, Yongli</creatorcontrib><creatorcontrib>Yin, Baocai</creatorcontrib><title>Hardware-Friendly 3-D CNN Acceleration With Balanced Kernel Group Sparsity</title><title>IEEE transactions on computer-aided design of integrated circuits and systems</title><addtitle>TCAD</addtitle><description><![CDATA[Being capable of extracting more information than 2-D convolutional neural networks (CNNs), 3-D CNNs have been playing a vital role in video analysis tasks like human action recognition, but their massive operations hinder the real-time execution on edge devices with constrained computation and memory resources. Although various model compression techniques have been applied to accelerate 2-D CNNs, there are rare efforts in investigating hardware-friendly pruning of 3-D CNNs and acceleration on customizable edge platforms like FPGAs. This work starts from proposing a kernel group row-column (KGRC) weight sparsity pattern, which is fine-grained to achieve high pruning ratios with negligible accuracy loss, and balanced across kernel groups to achieve high computation parallelism on hardware. The reweighted pruning algorithm for this sparsity is then presented and performed on 3-D CNNs, followed by quantization under different precisions. Along with model compression, FPGA-based accelerators with four modes are designed in support of the kernel group sparsity in multiple dimensions. The co-design framework of the pruning algorithm and the accelerator is tested on two representative 3-D CNNs, namely C3D and R(2+1)D, with the Xilinx ZCU102 FPGA platform for action recognition. The experimental results indicate that the accelerator implementation with the KGRC sparsity and 8-bit quantization achieves a good balance between the speedup and model accuracy, leading to acceleration ratios of <inline-formula> <tex-math notation="LaTeX">4.12\times </tex-math></inline-formula> for C3D and <inline-formula> <tex-math notation="LaTeX">3.85\times </tex-math></inline-formula> for R(2+1)D compared with the 16-bit baseline designs supporting only dense models.]]></description><subject>3-D convolutional neural network (CNN)</subject><subject>Computational modeling</subject><subject>Convolutional neural networks</subject><subject>edge device inference</subject><subject>Field programmable gate arrays</subject><subject>FPGA</subject><subject>Kernel</subject><subject>model compression</subject><subject>Parallel processing</subject><subject>Quantization (signal)</subject><subject>Three-dimensional displays</subject><subject>weight pruning</subject><issn>0278-0070</issn><issn>1937-4151</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkMFOwkAURSdGExH9ABMX8wOD781MYbrEIqASXIhx2TxmXmNNLWRaY_r3lsDC1d3ce3JzhLhFGCFCer_JprORBm1HxqQAFs7EAFMzURYTPBcD0BOnACZwKa6a5gsAbaLTgXheUgy_FFnNY8l1qDpp1Exm67Wces8VR2rLXS0_yvZTPlBFtecgXzjWXMlF3P3s5dueYlO23bW4KKhq-OaUQ_E-f9xkS7V6XTxl05XyaF2rUm_HmjWlzLjV1ic-uEQb3I7HQbuCLVEI1rtQgOtPF4EcBtKonYbEejRDgUeuj7umiVzk-1h-U-xyhPwgIz_IyA8y8pOMfnN33JTM_K-fQA9G8wepL1pc</recordid><startdate>202410</startdate><enddate>202410</enddate><creator>Sun, Mengshu</creator><creator>Xu, Kaidi</creator><creator>Lin, Xue</creator><creator>Hu, Yongli</creator><creator>Yin, Baocai</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0001-6210-8883</orcidid><orcidid>https://orcid.org/0000-0003-0440-438X</orcidid><orcidid>https://orcid.org/0000-0003-3540-1464</orcidid><orcidid>https://orcid.org/0000-0003-3121-1823</orcidid><orcidid>https://orcid.org/0000-0003-4437-0671</orcidid></search><sort><creationdate>202410</creationdate><title>Hardware-Friendly 3-D CNN Acceleration With Balanced Kernel Group Sparsity</title><author>Sun, Mengshu ; Xu, Kaidi ; Lin, Xue ; Hu, Yongli ; Yin, Baocai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c148t-9c462e2a9ee1b24c5cd85231b66d28fe4aadd4c8df08278fda81da21282054c13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>3-D convolutional neural network (CNN)</topic><topic>Computational modeling</topic><topic>Convolutional neural networks</topic><topic>edge device inference</topic><topic>Field programmable gate arrays</topic><topic>FPGA</topic><topic>Kernel</topic><topic>model compression</topic><topic>Parallel processing</topic><topic>Quantization (signal)</topic><topic>Three-dimensional displays</topic><topic>weight pruning</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Sun, Mengshu</creatorcontrib><creatorcontrib>Xu, Kaidi</creatorcontrib><creatorcontrib>Lin, Xue</creatorcontrib><creatorcontrib>Hu, Yongli</creatorcontrib><creatorcontrib>Yin, Baocai</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on computer-aided design of integrated circuits and systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Sun, Mengshu</au><au>Xu, Kaidi</au><au>Lin, Xue</au><au>Hu, Yongli</au><au>Yin, Baocai</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Hardware-Friendly 3-D CNN Acceleration With Balanced Kernel Group Sparsity</atitle><jtitle>IEEE transactions on computer-aided design of integrated circuits and systems</jtitle><stitle>TCAD</stitle><date>2024-10</date><risdate>2024</risdate><volume>43</volume><issue>10</issue><spage>3027</spage><epage>3040</epage><pages>3027-3040</pages><issn>0278-0070</issn><eissn>1937-4151</eissn><coden>ITCSDI</coden><abstract><![CDATA[Being capable of extracting more information than 2-D convolutional neural networks (CNNs), 3-D CNNs have been playing a vital role in video analysis tasks like human action recognition, but their massive operations hinder the real-time execution on edge devices with constrained computation and memory resources. Although various model compression techniques have been applied to accelerate 2-D CNNs, there are rare efforts in investigating hardware-friendly pruning of 3-D CNNs and acceleration on customizable edge platforms like FPGAs. This work starts from proposing a kernel group row-column (KGRC) weight sparsity pattern, which is fine-grained to achieve high pruning ratios with negligible accuracy loss, and balanced across kernel groups to achieve high computation parallelism on hardware. The reweighted pruning algorithm for this sparsity is then presented and performed on 3-D CNNs, followed by quantization under different precisions. Along with model compression, FPGA-based accelerators with four modes are designed in support of the kernel group sparsity in multiple dimensions. The co-design framework of the pruning algorithm and the accelerator is tested on two representative 3-D CNNs, namely C3D and R(2+1)D, with the Xilinx ZCU102 FPGA platform for action recognition. The experimental results indicate that the accelerator implementation with the KGRC sparsity and 8-bit quantization achieves a good balance between the speedup and model accuracy, leading to acceleration ratios of <inline-formula> <tex-math notation="LaTeX">4.12\times </tex-math></inline-formula> for C3D and <inline-formula> <tex-math notation="LaTeX">3.85\times </tex-math></inline-formula> for R(2+1)D compared with the 16-bit baseline designs supporting only dense models.]]></abstract><pub>IEEE</pub><doi>10.1109/TCAD.2024.3390040</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0001-6210-8883</orcidid><orcidid>https://orcid.org/0000-0003-0440-438X</orcidid><orcidid>https://orcid.org/0000-0003-3540-1464</orcidid><orcidid>https://orcid.org/0000-0003-3121-1823</orcidid><orcidid>https://orcid.org/0000-0003-4437-0671</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 0278-0070
ispartof	IEEE transactions on computer-aided design of integrated circuits and systems, 2024-10, Vol.43 (10), p.3027-3040
issn	0278-0070 1937-4151
language	eng
recordid	cdi_crossref_primary_10_1109_TCAD_2024_3390040
source	IEEE Electronic Library (IEL)
subjects	3-D convolutional neural network (CNN) Computational modeling Convolutional neural networks edge device inference Field programmable gate arrays FPGA Kernel model compression Parallel processing Quantization (signal) Three-dimensional displays weight pruning
title	Hardware-Friendly 3-D CNN Acceleration With Balanced Kernel Group Sparsity
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T10%3A46%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Hardware-Friendly%203-D%20CNN%20Acceleration%20With%20Balanced%20Kernel%20Group%20Sparsity&rft.jtitle=IEEE%20transactions%20on%20computer-aided%20design%20of%20integrated%20circuits%20and%20systems&rft.au=Sun,%20Mengshu&rft.date=2024-10&rft.volume=43&rft.issue=10&rft.spage=3027&rft.epage=3040&rft.pages=3027-3040&rft.issn=0278-0070&rft.eissn=1937-4151&rft.coden=ITCSDI&rft_id=info:doi/10.1109/TCAD.2024.3390040&rft_dat=%3Ccrossref_RIE%3E10_1109_TCAD_2024_3390040%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10502121&rfr_iscdi=true