NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation

Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contras...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2024-02, Vol.34 (2), p.1098-1113
Hauptverfasser: Feng, Guangkun, Xu, Ting-Bing, Liu, Fulin, Liu, Mingkun, Zhenzhong, Wei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1113
container_issue 2
container_start_page 1098
container_title IEEE transactions on circuits and systems for video technology
container_volume 34
creator Feng, Guangkun
Xu, Ting-Bing
Liu, Fulin
Liu, Mingkun
Zhenzhong, Wei
description Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time.
doi_str_mv 10.1109/TCSVT.2023.3290617
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2923122833</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10168178</ieee_id><sourcerecordid>2923122833</sourcerecordid><originalsourceid>FETCH-LOGICAL-c296t-965be96d20073fc27b5b36ab62174b7a39cdd6c8de6ed93440d14b83b581ba6d3</originalsourceid><addsrcrecordid>eNpNkE1PAjEQhhujiYj-AeOhiefFfmy7rTeDiCYEDa5cm-12IIuwxXaJ8d9bhIOnmWSed2byIHRNyYBSou_K4fu8HDDC-IAzTSQtTlCPCqEyxog4TT0RNFOMinN0EeOKEJqrvOihcjqfZVPo7vHUh021xnOoOx_weNc4cHgGywAxNr7FCfr24RMv0vSxidB2VbtcJ0Y-4jcfAY9i12yqLrGX6GxRrSNcHWsffTyNyuFzNnkdvwwfJlnNtOwyLYUFLR0jpOCLmhVWWC4rKxktcltUXNfOyVo5kOA0z3PiaG4Vt0JRW0nH--j2sHcb_NcOYmdWfhfadNIwzThlTHGeKHag6uBjDLAw25AeDT-GErO3Z_7smb09c7SXQjeHUAMA_wJUKloo_gtcEWrx</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2923122833</pqid></control><display><type>article</type><title>NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation</title><source>IEEE Electronic Library (IEL)</source><creator>Feng, Guangkun ; Xu, Ting-Bing ; Liu, Fulin ; Liu, Mingkun ; Zhenzhong, Wei</creator><creatorcontrib>Feng, Guangkun ; Xu, Ting-Bing ; Liu, Fulin ; Liu, Mingkun ; Zhenzhong, Wei</creatorcontrib><description>Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2023.3290617</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>3D normal vector ; Accuracy ; Artificial neural networks ; Cameras ; Computer vision ; Degeneration ; direct regression ; Disentangled representation learning ; disentanglement ; Estimation ; Feature extraction ; monocular vision ; Object detection ; Object pose estimation ; Pose estimation ; Regression ; Robotics ; Rotation ; Solid modeling ; Three-dimensional displays ; Translations ; Vectors</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-02, Vol.34 (2), p.1098-1113</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c296t-965be96d20073fc27b5b36ab62174b7a39cdd6c8de6ed93440d14b83b581ba6d3</citedby><cites>FETCH-LOGICAL-c296t-965be96d20073fc27b5b36ab62174b7a39cdd6c8de6ed93440d14b83b581ba6d3</cites><orcidid>0000-0003-1949-5958 ; 0000-0002-8925-9259 ; 0000-0002-2033-2040 ; 0000-0002-0835-5792 ; 0000-0002-5550-7699</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10168178$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10168178$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Feng, Guangkun</creatorcontrib><creatorcontrib>Xu, Ting-Bing</creatorcontrib><creatorcontrib>Liu, Fulin</creatorcontrib><creatorcontrib>Liu, Mingkun</creatorcontrib><creatorcontrib>Zhenzhong, Wei</creatorcontrib><title>NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time.</description><subject>3D normal vector</subject><subject>Accuracy</subject><subject>Artificial neural networks</subject><subject>Cameras</subject><subject>Computer vision</subject><subject>Degeneration</subject><subject>direct regression</subject><subject>Disentangled representation learning</subject><subject>disentanglement</subject><subject>Estimation</subject><subject>Feature extraction</subject><subject>monocular vision</subject><subject>Object detection</subject><subject>Object pose estimation</subject><subject>Pose estimation</subject><subject>Regression</subject><subject>Robotics</subject><subject>Rotation</subject><subject>Solid modeling</subject><subject>Three-dimensional displays</subject><subject>Translations</subject><subject>Vectors</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1PAjEQhhujiYj-AeOhiefFfmy7rTeDiCYEDa5cm-12IIuwxXaJ8d9bhIOnmWSed2byIHRNyYBSou_K4fu8HDDC-IAzTSQtTlCPCqEyxog4TT0RNFOMinN0EeOKEJqrvOihcjqfZVPo7vHUh021xnOoOx_weNc4cHgGywAxNr7FCfr24RMv0vSxidB2VbtcJ0Y-4jcfAY9i12yqLrGX6GxRrSNcHWsffTyNyuFzNnkdvwwfJlnNtOwyLYUFLR0jpOCLmhVWWC4rKxktcltUXNfOyVo5kOA0z3PiaG4Vt0JRW0nH--j2sHcb_NcOYmdWfhfadNIwzThlTHGeKHag6uBjDLAw25AeDT-GErO3Z_7smb09c7SXQjeHUAMA_wJUKloo_gtcEWrx</recordid><startdate>20240201</startdate><enddate>20240201</enddate><creator>Feng, Guangkun</creator><creator>Xu, Ting-Bing</creator><creator>Liu, Fulin</creator><creator>Liu, Mingkun</creator><creator>Zhenzhong, Wei</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-1949-5958</orcidid><orcidid>https://orcid.org/0000-0002-8925-9259</orcidid><orcidid>https://orcid.org/0000-0002-2033-2040</orcidid><orcidid>https://orcid.org/0000-0002-0835-5792</orcidid><orcidid>https://orcid.org/0000-0002-5550-7699</orcidid></search><sort><creationdate>20240201</creationdate><title>NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation</title><author>Feng, Guangkun ; Xu, Ting-Bing ; Liu, Fulin ; Liu, Mingkun ; Zhenzhong, Wei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c296t-965be96d20073fc27b5b36ab62174b7a39cdd6c8de6ed93440d14b83b581ba6d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>3D normal vector</topic><topic>Accuracy</topic><topic>Artificial neural networks</topic><topic>Cameras</topic><topic>Computer vision</topic><topic>Degeneration</topic><topic>direct regression</topic><topic>Disentangled representation learning</topic><topic>disentanglement</topic><topic>Estimation</topic><topic>Feature extraction</topic><topic>monocular vision</topic><topic>Object detection</topic><topic>Object pose estimation</topic><topic>Pose estimation</topic><topic>Regression</topic><topic>Robotics</topic><topic>Rotation</topic><topic>Solid modeling</topic><topic>Three-dimensional displays</topic><topic>Translations</topic><topic>Vectors</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Feng, Guangkun</creatorcontrib><creatorcontrib>Xu, Ting-Bing</creatorcontrib><creatorcontrib>Liu, Fulin</creatorcontrib><creatorcontrib>Liu, Mingkun</creatorcontrib><creatorcontrib>Zhenzhong, Wei</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Feng, Guangkun</au><au>Xu, Ting-Bing</au><au>Liu, Fulin</au><au>Liu, Mingkun</au><au>Zhenzhong, Wei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-02-01</date><risdate>2024</risdate><volume>34</volume><issue>2</issue><spage>1098</spage><epage>1113</epage><pages>1098-1113</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2023.3290617</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0003-1949-5958</orcidid><orcidid>https://orcid.org/0000-0002-8925-9259</orcidid><orcidid>https://orcid.org/0000-0002-2033-2040</orcidid><orcidid>https://orcid.org/0000-0002-0835-5792</orcidid><orcidid>https://orcid.org/0000-0002-5550-7699</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1051-8215
ispartof IEEE transactions on circuits and systems for video technology, 2024-02, Vol.34 (2), p.1098-1113
issn 1051-8215
1558-2205
language eng
recordid cdi_proquest_journals_2923122833
source IEEE Electronic Library (IEL)
subjects 3D normal vector
Accuracy
Artificial neural networks
Cameras
Computer vision
Degeneration
direct regression
Disentangled representation learning
disentanglement
Estimation
Feature extraction
monocular vision
Object detection
Object pose estimation
Pose estimation
Regression
Robotics
Rotation
Solid modeling
Three-dimensional displays
Translations
Vectors
title NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T03%3A44%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=NVR-Net:%20Normal%20Vector%20Guided%20Regression%20Network%20for%20Disentangled%206D%20Pose%20Estimation&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Feng,%20Guangkun&rft.date=2024-02-01&rft.volume=34&rft.issue=2&rft.spage=1098&rft.epage=1113&rft.pages=1098-1113&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2023.3290617&rft_dat=%3Cproquest_RIE%3E2923122833%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2923122833&rft_id=info:pmid/&rft_ieee_id=10168178&rfr_iscdi=true