NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation

Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contras...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2024-02, Vol.34 (2), p.1098-1113
Hauptverfasser:	Feng, Guangkun, Xu, Ting-Bing, Liu, Fulin, Liu, Mingkun, Zhenzhong, Wei
Format:	Artikel
Sprache:	eng
Schlagworte:	3D normal vector Accuracy Artificial neural networks Cameras Computer vision Degeneration direct regression Disentangled representation learning disentanglement Estimation Feature extraction monocular vision Object detection Object pose estimation Pose estimation Regression Robotics Rotation Solid modeling Three-dimensional displays Translations Vectors
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1113
container_issue	2
container_start_page	1098
container_title	IEEE transactions on circuits and systems for video technology
container_volume	34
creator	Feng, Guangkun Xu, Ting-Bing Liu, Fulin Liu, Mingkun Zhenzhong, Wei
description	Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time.
doi_str_mv	10.1109/TCSVT.2023.3290617
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2923122833</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10168178</ieee_id><sourcerecordid>2923122833</sourcerecordid><originalsourceid>FETCH-LOGICAL-c296t-965be96d20073fc27b5b36ab62174b7a39cdd6c8de6ed93440d14b83b581ba6d3</originalsourceid><addsrcrecordid>eNpNkE1PAjEQhhujiYj-AeOhiefFfmy7rTeDiCYEDa5cm-12IIuwxXaJ8d9bhIOnmWSed2byIHRNyYBSou_K4fu8HDDC-IAzTSQtTlCPCqEyxog4TT0RNFOMinN0EeOKEJqrvOihcjqfZVPo7vHUh021xnOoOx_weNc4cHgGywAxNr7FCfr24RMv0vSxidB2VbtcJ0Y-4jcfAY9i12yqLrGX6GxRrSNcHWsffTyNyuFzNnkdvwwfJlnNtOwyLYUFLR0jpOCLmhVWWC4rKxktcltUXNfOyVo5kOA0z3PiaG4Vt0JRW0nH--j2sHcb_NcOYmdWfhfadNIwzThlTHGeKHag6uBjDLAw25AeDT-GErO3Z_7smb09c7SXQjeHUAMA_wJUKloo_gtcEWrx</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2923122833</pqid></control><display><type>article</type><title>NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation</title><source>IEEE Electronic Library (IEL)</source><creator>Feng, Guangkun ; Xu, Ting-Bing ; Liu, Fulin ; Liu, Mingkun ; Zhenzhong, Wei</creator><creatorcontrib>Feng, Guangkun ; Xu, Ting-Bing ; Liu, Fulin ; Liu, Mingkun ; Zhenzhong, Wei</creatorcontrib><description>Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2023.3290617</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>3D normal vector ; Accuracy ; Artificial neural networks ; Cameras ; Computer vision ; Degeneration ; direct regression ; Disentangled representation learning ; disentanglement ; Estimation ; Feature extraction ; monocular vision ; Object detection ; Object pose estimation ; Pose estimation ; Regression ; Robotics ; Rotation ; Solid modeling ; Three-dimensional displays ; Translations ; Vectors</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-02, Vol.34 (2), p.1098-1113</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c296t-965be96d20073fc27b5b36ab62174b7a39cdd6c8de6ed93440d14b83b581ba6d3</citedby><cites>FETCH-LOGICAL-c296t-965be96d20073fc27b5b36ab62174b7a39cdd6c8de6ed93440d14b83b581ba6d3</cites><orcidid>0000-0003-1949-5958 ; 0000-0002-8925-9259 ; 0000-0002-2033-2040 ; 0000-0002-0835-5792 ; 0000-0002-5550-7699</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10168178$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10168178$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Feng, Guangkun</creatorcontrib><creatorcontrib>Xu, Ting-Bing</creatorcontrib><creatorcontrib>Liu, Fulin</creatorcontrib><creatorcontrib>Liu, Mingkun</creatorcontrib><creatorcontrib>Zhenzhong, Wei</creatorcontrib><title>NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time.</description><subject>3D normal vector</subject><subject>Accuracy</subject><subject>Artificial neural networks</subject><subject>Cameras</subject><subject>Computer vision</subject><subject>Degeneration</subject><subject>direct regression</subject><subject>Disentangled representation learning</subject><subject>disentanglement</subject><subject>Estimation</subject><subject>Feature extraction</subject><subject>monocular vision</subject><subject>Object detection</subject><subject>Object pose estimation</subject><subject>Pose estimation</subject><subject>Regression</subject><subject>Robotics</subject><subject>Rotation</subject><subject>Solid modeling</subject><subject>Three-dimensional displays</subject><subject>Translations</subject><subject>Vectors</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1PAjEQhhujiYj-AeOhiefFfmy7rTeDiCYEDa5cm-12IIuwxXaJ8d9bhIOnmWSed2byIHRNyYBSou_K4fu8HDDC-IAzTSQtTlCPCqEyxog4TT0RNFOMinN0EeOKEJqrvOihcjqfZVPo7vHUh021xnOoOx_weNc4cHgGywAxNr7FCfr24RMv0vSxidB2VbtcJ0Y-4jcfAY9i12yqLrGX6GxRrSNcHWsffTyNyuFzNnkdvwwfJlnNtOwyLYUFLR0jpOCLmhVWWC4rKxktcltUXNfOyVo5kOA0z3PiaG4Vt0JRW0nH--j2sHcb_NcOYmdWfhfadNIwzThlTHGeKHag6uBjDLAw25AeDT-GErO3Z_7smb09c7SXQjeHUAMA_wJUKloo_gtcEWrx</recordid><startdate>20240201</startdate><enddate>20240201</enddate><creator>Feng, Guangkun</creator><creator>Xu, Ting-Bing</creator><creator>Liu, Fulin</creator><creator>Liu, Mingkun</creator><creator>Zhenzhong, Wei</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-1949-5958</orcidid><orcidid>https://orcid.org/0000-0002-8925-9259</orcidid><orcidid>https://orcid.org/0000-0002-2033-2040</orcidid><orcidid>https://orcid.org/0000-0002-0835-5792</orcidid><orcidid>https://orcid.org/0000-0002-5550-7699</orcidid></search><sort><creationdate>20240201</creationdate><title>NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation</title><author>Feng, Guangkun ; Xu, Ting-Bing ; Liu, Fulin ; Liu, Mingkun ; Zhenzhong, Wei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c296t-965be96d20073fc27b5b36ab62174b7a39cdd6c8de6ed93440d14b83b581ba6d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>3D normal vector</topic><topic>Accuracy</topic><topic>Artificial neural networks</topic><topic>Cameras</topic><topic>Computer vision</topic><topic>Degeneration</topic><topic>direct regression</topic><topic>Disentangled representation learning</topic><topic>disentanglement</topic><topic>Estimation</topic><topic>Feature extraction</topic><topic>monocular vision</topic><topic>Object detection</topic><topic>Object pose estimation</topic><topic>Pose estimation</topic><topic>Regression</topic><topic>Robotics</topic><topic>Rotation</topic><topic>Solid modeling</topic><topic>Three-dimensional displays</topic><topic>Translations</topic><topic>Vectors</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Feng, Guangkun</creatorcontrib><creatorcontrib>Xu, Ting-Bing</creatorcontrib><creatorcontrib>Liu, Fulin</creatorcontrib><creatorcontrib>Liu, Mingkun</creatorcontrib><creatorcontrib>Zhenzhong, Wei</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Feng, Guangkun</au><au>Xu, Ting-Bing</au><au>Liu, Fulin</au><au>Liu, Mingkun</au><au>Zhenzhong, Wei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-02-01</date><risdate>2024</risdate><volume>34</volume><issue>2</issue><spage>1098</spage><epage>1113</epage><pages>1098-1113</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2023.3290617</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0003-1949-5958</orcidid><orcidid>https://orcid.org/0000-0002-8925-9259</orcidid><orcidid>https://orcid.org/0000-0002-2033-2040</orcidid><orcidid>https://orcid.org/0000-0002-0835-5792</orcidid><orcidid>https://orcid.org/0000-0002-5550-7699</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1051-8215
ispartof	IEEE transactions on circuits and systems for video technology, 2024-02, Vol.34 (2), p.1098-1113
issn	1051-8215 1558-2205
language	eng
recordid	cdi_proquest_journals_2923122833
source	IEEE Electronic Library (IEL)
subjects	3D normal vector Accuracy Artificial neural networks Cameras Computer vision Degeneration direct regression Disentangled representation learning disentanglement Estimation Feature extraction monocular vision Object detection Object pose estimation Pose estimation Regression Robotics Rotation Solid modeling Three-dimensional displays Translations Vectors
title	NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T03%3A44%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=NVR-Net:%20Normal%20Vector%20Guided%20Regression%20Network%20for%20Disentangled%206D%20Pose%20Estimation&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Feng,%20Guangkun&rft.date=2024-02-01&rft.volume=34&rft.issue=2&rft.spage=1098&rft.epage=1113&rft.pages=1098-1113&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2023.3290617&rft_dat=%3Cproquest_RIE%3E2923122833%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2923122833&rft_id=info:pmid/&rft_ieee_id=10168178&rfr_iscdi=true