HF-HRNet: A Simple Hardware Friendly High-Resolution Network

High-resolution networks have made significant progress in dense prediction tasks such as human pose estimation and semantic segmentation. To better explore this high-resolution mechanism on mobile devices, Lite-HRNet incorporates shuffle operations to reduce computational complexity in the channel...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2024-08, Vol.34 (8), p.7699-7711
Hauptverfasser: Zhang, Hao, Dun, Yujie, Pei, Yixuan, Lai, Shenqi, Liu, Chengxu, Zhang, Kaipeng, Qian, Xueming
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 7711
container_issue 8
container_start_page 7699
container_title IEEE transactions on circuits and systems for video technology
container_volume 34
creator Zhang, Hao
Dun, Yujie
Pei, Yixuan
Lai, Shenqi
Liu, Chengxu
Zhang, Kaipeng
Qian, Xueming
description High-resolution networks have made significant progress in dense prediction tasks such as human pose estimation and semantic segmentation. To better explore this high-resolution mechanism on mobile devices, Lite-HRNet incorporates shuffle operations to reduce computational complexity in the channel dimension, while Dite-HRNet employs dynamic convolution and pooling to capture long-range interactions with low computational complexity in the spatial dimension. The core idea behind both approaches is to efficiently capture information in either the channel or spatial dimension. However, shuffle operations and dynamic operations are not hardware-friendly. As a result, both Lite-HRNet and Dite-HRNet cannot achieve the desired inference speed on specialized devices, including Neural Processing Units (NPUs) and Graphics Processing Units (GPUs). To overcome these limitations, we present a simple Hardware-Friendly Lightweight High-resolution Network (HF-HRNet) based on our proposed Hardware-Friendly Uniform-sized Mug (HUM) block. HUM block mainly consists of the Cascaded Depthwise (CAD) block and Multi-Scale Context Embedding (MCE) block. The CAD block cascades depthwise convolutions to obtain a larger receptive field in the spatial dimension, while the MCE block aggregates multi-scale spatial feature information from different scales and adjusts channel features. Extensive experiments are conducted on human pose estimation (COCO, MPII) and semantic segmentation (Cityscapes), resulting in a better trade-off between inference speed and accuracy on both NPUs and GPUs. It is noteworthy that on the COCO test-dev set, HF-HRNet-30 outperforms Dite-HRNet-30 and Lite-HRNet-30 by 1.9 AP and 2.8 AP, respectively, while running about 13 times faster and 9 times faster on NPUs, respectively. Our code are publicly available for use: https://github.com/zhanghao5201/HF-HRNet .
doi_str_mv 10.1109/TCSVT.2024.3377365
format Article
fullrecord <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TCSVT_2024_3377365</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10472506</ieee_id><sourcerecordid>10_1109_TCSVT_2024_3377365</sourcerecordid><originalsourceid>FETCH-LOGICAL-c268t-c2faa5c105598c4e25614f26dfcef1e014db5cf378f7d596cd4f264273d91c833</originalsourceid><addsrcrecordid>eNpNj89Kw0AQhxdRsFZfQDzkBbbu3-xGvJRgjFAU2ug1rLuzGk2bshspfXsT24OXmYGZb_h9CF1TMqOUZLdVvnqrZowwMeNcKZ7KEzShUmrMGJGnw0wkxZpReY4uYvwihAot1ATdlwUul8_Q3yXzZNWsty0kpQluZwIkRWhg49p9UjYfn3gJsWt_-qbbJMP9rgvfl-jMmzbC1bFP0WvxUOUlXrw8PuXzBbYs1f1QvTHSDhFkpq0AJlMqPEudt-ApDFHcu7SeK-2Vk1lq3bgVTHGXUas5nyJ2-GtDF2MAX29DszZhX1NSj_71n389-tdH_wG6OUANAPwDhGKSpPwX4yRWUg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>HF-HRNet: A Simple Hardware Friendly High-Resolution Network</title><source>IEEE Electronic Library (IEL)</source><creator>Zhang, Hao ; Dun, Yujie ; Pei, Yixuan ; Lai, Shenqi ; Liu, Chengxu ; Zhang, Kaipeng ; Qian, Xueming</creator><creatorcontrib>Zhang, Hao ; Dun, Yujie ; Pei, Yixuan ; Lai, Shenqi ; Liu, Chengxu ; Zhang, Kaipeng ; Qian, Xueming</creatorcontrib><description>High-resolution networks have made significant progress in dense prediction tasks such as human pose estimation and semantic segmentation. To better explore this high-resolution mechanism on mobile devices, Lite-HRNet incorporates shuffle operations to reduce computational complexity in the channel dimension, while Dite-HRNet employs dynamic convolution and pooling to capture long-range interactions with low computational complexity in the spatial dimension. The core idea behind both approaches is to efficiently capture information in either the channel or spatial dimension. However, shuffle operations and dynamic operations are not hardware-friendly. As a result, both Lite-HRNet and Dite-HRNet cannot achieve the desired inference speed on specialized devices, including Neural Processing Units (NPUs) and Graphics Processing Units (GPUs). To overcome these limitations, we present a simple Hardware-Friendly Lightweight High-resolution Network (HF-HRNet) based on our proposed Hardware-Friendly Uniform-sized Mug (HUM) block. HUM block mainly consists of the Cascaded Depthwise (CAD) block and Multi-Scale Context Embedding (MCE) block. The CAD block cascades depthwise convolutions to obtain a larger receptive field in the spatial dimension, while the MCE block aggregates multi-scale spatial feature information from different scales and adjusts channel features. Extensive experiments are conducted on human pose estimation (COCO, MPII) and semantic segmentation (Cityscapes), resulting in a better trade-off between inference speed and accuracy on both NPUs and GPUs. It is noteworthy that on the COCO test-dev set, HF-HRNet-30 outperforms Dite-HRNet-30 and Lite-HRNet-30 by 1.9 AP and 2.8 AP, respectively, while running about 13 times faster and 9 times faster on NPUs, respectively. Our code are publicly available for use: https://github.com/zhanghao5201/HF-HRNet .</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2024.3377365</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>IEEE</publisher><subject>Delays ; High-resolution imaging ; high-resolution networks ; Human factors ; Human pose estimation ; Mobile handsets ; Network architecture ; networks ; Pose estimation ; Semantic segmentation ; Solid modeling</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-08, Vol.34 (8), p.7699-7711</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c268t-c2faa5c105598c4e25614f26dfcef1e014db5cf378f7d596cd4f264273d91c833</citedby><cites>FETCH-LOGICAL-c268t-c2faa5c105598c4e25614f26dfcef1e014db5cf378f7d596cd4f264273d91c833</cites><orcidid>0009-0005-5641-4004 ; 0000-0001-8673-2218 ; 0000-0002-3173-6307 ; 0000-0002-3572-7053 ; 0000-0001-8023-9465 ; 0000-0001-5213-1000</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10472506$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10472506$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhang, Hao</creatorcontrib><creatorcontrib>Dun, Yujie</creatorcontrib><creatorcontrib>Pei, Yixuan</creatorcontrib><creatorcontrib>Lai, Shenqi</creatorcontrib><creatorcontrib>Liu, Chengxu</creatorcontrib><creatorcontrib>Zhang, Kaipeng</creatorcontrib><creatorcontrib>Qian, Xueming</creatorcontrib><title>HF-HRNet: A Simple Hardware Friendly High-Resolution Network</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>High-resolution networks have made significant progress in dense prediction tasks such as human pose estimation and semantic segmentation. To better explore this high-resolution mechanism on mobile devices, Lite-HRNet incorporates shuffle operations to reduce computational complexity in the channel dimension, while Dite-HRNet employs dynamic convolution and pooling to capture long-range interactions with low computational complexity in the spatial dimension. The core idea behind both approaches is to efficiently capture information in either the channel or spatial dimension. However, shuffle operations and dynamic operations are not hardware-friendly. As a result, both Lite-HRNet and Dite-HRNet cannot achieve the desired inference speed on specialized devices, including Neural Processing Units (NPUs) and Graphics Processing Units (GPUs). To overcome these limitations, we present a simple Hardware-Friendly Lightweight High-resolution Network (HF-HRNet) based on our proposed Hardware-Friendly Uniform-sized Mug (HUM) block. HUM block mainly consists of the Cascaded Depthwise (CAD) block and Multi-Scale Context Embedding (MCE) block. The CAD block cascades depthwise convolutions to obtain a larger receptive field in the spatial dimension, while the MCE block aggregates multi-scale spatial feature information from different scales and adjusts channel features. Extensive experiments are conducted on human pose estimation (COCO, MPII) and semantic segmentation (Cityscapes), resulting in a better trade-off between inference speed and accuracy on both NPUs and GPUs. It is noteworthy that on the COCO test-dev set, HF-HRNet-30 outperforms Dite-HRNet-30 and Lite-HRNet-30 by 1.9 AP and 2.8 AP, respectively, while running about 13 times faster and 9 times faster on NPUs, respectively. Our code are publicly available for use: https://github.com/zhanghao5201/HF-HRNet .</description><subject>Delays</subject><subject>High-resolution imaging</subject><subject>high-resolution networks</subject><subject>Human factors</subject><subject>Human pose estimation</subject><subject>Mobile handsets</subject><subject>Network architecture</subject><subject>networks</subject><subject>Pose estimation</subject><subject>Semantic segmentation</subject><subject>Solid modeling</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNj89Kw0AQhxdRsFZfQDzkBbbu3-xGvJRgjFAU2ug1rLuzGk2bshspfXsT24OXmYGZb_h9CF1TMqOUZLdVvnqrZowwMeNcKZ7KEzShUmrMGJGnw0wkxZpReY4uYvwihAot1ATdlwUul8_Q3yXzZNWsty0kpQluZwIkRWhg49p9UjYfn3gJsWt_-qbbJMP9rgvfl-jMmzbC1bFP0WvxUOUlXrw8PuXzBbYs1f1QvTHSDhFkpq0AJlMqPEudt-ApDFHcu7SeK-2Vk1lq3bgVTHGXUas5nyJ2-GtDF2MAX29DszZhX1NSj_71n389-tdH_wG6OUANAPwDhGKSpPwX4yRWUg</recordid><startdate>20240801</startdate><enddate>20240801</enddate><creator>Zhang, Hao</creator><creator>Dun, Yujie</creator><creator>Pei, Yixuan</creator><creator>Lai, Shenqi</creator><creator>Liu, Chengxu</creator><creator>Zhang, Kaipeng</creator><creator>Qian, Xueming</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0009-0005-5641-4004</orcidid><orcidid>https://orcid.org/0000-0001-8673-2218</orcidid><orcidid>https://orcid.org/0000-0002-3173-6307</orcidid><orcidid>https://orcid.org/0000-0002-3572-7053</orcidid><orcidid>https://orcid.org/0000-0001-8023-9465</orcidid><orcidid>https://orcid.org/0000-0001-5213-1000</orcidid></search><sort><creationdate>20240801</creationdate><title>HF-HRNet: A Simple Hardware Friendly High-Resolution Network</title><author>Zhang, Hao ; Dun, Yujie ; Pei, Yixuan ; Lai, Shenqi ; Liu, Chengxu ; Zhang, Kaipeng ; Qian, Xueming</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c268t-c2faa5c105598c4e25614f26dfcef1e014db5cf378f7d596cd4f264273d91c833</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Delays</topic><topic>High-resolution imaging</topic><topic>high-resolution networks</topic><topic>Human factors</topic><topic>Human pose estimation</topic><topic>Mobile handsets</topic><topic>Network architecture</topic><topic>networks</topic><topic>Pose estimation</topic><topic>Semantic segmentation</topic><topic>Solid modeling</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Hao</creatorcontrib><creatorcontrib>Dun, Yujie</creatorcontrib><creatorcontrib>Pei, Yixuan</creatorcontrib><creatorcontrib>Lai, Shenqi</creatorcontrib><creatorcontrib>Liu, Chengxu</creatorcontrib><creatorcontrib>Zhang, Kaipeng</creatorcontrib><creatorcontrib>Qian, Xueming</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Hao</au><au>Dun, Yujie</au><au>Pei, Yixuan</au><au>Lai, Shenqi</au><au>Liu, Chengxu</au><au>Zhang, Kaipeng</au><au>Qian, Xueming</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>HF-HRNet: A Simple Hardware Friendly High-Resolution Network</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-08-01</date><risdate>2024</risdate><volume>34</volume><issue>8</issue><spage>7699</spage><epage>7711</epage><pages>7699-7711</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>High-resolution networks have made significant progress in dense prediction tasks such as human pose estimation and semantic segmentation. To better explore this high-resolution mechanism on mobile devices, Lite-HRNet incorporates shuffle operations to reduce computational complexity in the channel dimension, while Dite-HRNet employs dynamic convolution and pooling to capture long-range interactions with low computational complexity in the spatial dimension. The core idea behind both approaches is to efficiently capture information in either the channel or spatial dimension. However, shuffle operations and dynamic operations are not hardware-friendly. As a result, both Lite-HRNet and Dite-HRNet cannot achieve the desired inference speed on specialized devices, including Neural Processing Units (NPUs) and Graphics Processing Units (GPUs). To overcome these limitations, we present a simple Hardware-Friendly Lightweight High-resolution Network (HF-HRNet) based on our proposed Hardware-Friendly Uniform-sized Mug (HUM) block. HUM block mainly consists of the Cascaded Depthwise (CAD) block and Multi-Scale Context Embedding (MCE) block. The CAD block cascades depthwise convolutions to obtain a larger receptive field in the spatial dimension, while the MCE block aggregates multi-scale spatial feature information from different scales and adjusts channel features. Extensive experiments are conducted on human pose estimation (COCO, MPII) and semantic segmentation (Cityscapes), resulting in a better trade-off between inference speed and accuracy on both NPUs and GPUs. It is noteworthy that on the COCO test-dev set, HF-HRNet-30 outperforms Dite-HRNet-30 and Lite-HRNet-30 by 1.9 AP and 2.8 AP, respectively, while running about 13 times faster and 9 times faster on NPUs, respectively. Our code are publicly available for use: https://github.com/zhanghao5201/HF-HRNet .</abstract><pub>IEEE</pub><doi>10.1109/TCSVT.2024.3377365</doi><tpages>13</tpages><orcidid>https://orcid.org/0009-0005-5641-4004</orcidid><orcidid>https://orcid.org/0000-0001-8673-2218</orcidid><orcidid>https://orcid.org/0000-0002-3173-6307</orcidid><orcidid>https://orcid.org/0000-0002-3572-7053</orcidid><orcidid>https://orcid.org/0000-0001-8023-9465</orcidid><orcidid>https://orcid.org/0000-0001-5213-1000</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1051-8215
ispartof IEEE transactions on circuits and systems for video technology, 2024-08, Vol.34 (8), p.7699-7711
issn 1051-8215
1558-2205
language eng
recordid cdi_crossref_primary_10_1109_TCSVT_2024_3377365
source IEEE Electronic Library (IEL)
subjects Delays
High-resolution imaging
high-resolution networks
Human factors
Human pose estimation
Mobile handsets
Network architecture
networks
Pose estimation
Semantic segmentation
Solid modeling
title HF-HRNet: A Simple Hardware Friendly High-Resolution Network
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T18%3A14%3A37IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=HF-HRNet:%20A%20Simple%20Hardware%20Friendly%20High-Resolution%20Network&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Zhang,%20Hao&rft.date=2024-08-01&rft.volume=34&rft.issue=8&rft.spage=7699&rft.epage=7711&rft.pages=7699-7711&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2024.3377365&rft_dat=%3Ccrossref_RIE%3E10_1109_TCSVT_2024_3377365%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10472506&rfr_iscdi=true