NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation
Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contras...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on circuits and systems for video technology 2024-02, Vol.34 (2), p.1098-1113 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 1113 |
---|---|
container_issue | 2 |
container_start_page | 1098 |
container_title | IEEE transactions on circuits and systems for video technology |
container_volume | 34 |
creator | Feng, Guangkun Xu, Ting-Bing Liu, Fulin Liu, Mingkun Zhenzhong, Wei |
description | Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time. |
doi_str_mv | 10.1109/TCSVT.2023.3290617 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2923122833</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10168178</ieee_id><sourcerecordid>2923122833</sourcerecordid><originalsourceid>FETCH-LOGICAL-c296t-965be96d20073fc27b5b36ab62174b7a39cdd6c8de6ed93440d14b83b581ba6d3</originalsourceid><addsrcrecordid>eNpNkE1PAjEQhhujiYj-AeOhiefFfmy7rTeDiCYEDa5cm-12IIuwxXaJ8d9bhIOnmWSed2byIHRNyYBSou_K4fu8HDDC-IAzTSQtTlCPCqEyxog4TT0RNFOMinN0EeOKEJqrvOihcjqfZVPo7vHUh021xnOoOx_weNc4cHgGywAxNr7FCfr24RMv0vSxidB2VbtcJ0Y-4jcfAY9i12yqLrGX6GxRrSNcHWsffTyNyuFzNnkdvwwfJlnNtOwyLYUFLR0jpOCLmhVWWC4rKxktcltUXNfOyVo5kOA0z3PiaG4Vt0JRW0nH--j2sHcb_NcOYmdWfhfadNIwzThlTHGeKHag6uBjDLAw25AeDT-GErO3Z_7smb09c7SXQjeHUAMA_wJUKloo_gtcEWrx</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2923122833</pqid></control><display><type>article</type><title>NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation</title><source>IEEE Electronic Library (IEL)</source><creator>Feng, Guangkun ; Xu, Ting-Bing ; Liu, Fulin ; Liu, Mingkun ; Zhenzhong, Wei</creator><creatorcontrib>Feng, Guangkun ; Xu, Ting-Bing ; Liu, Fulin ; Liu, Mingkun ; Zhenzhong, Wei</creatorcontrib><description>Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2023.3290617</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>3D normal vector ; Accuracy ; Artificial neural networks ; Cameras ; Computer vision ; Degeneration ; direct regression ; Disentangled representation learning ; disentanglement ; Estimation ; Feature extraction ; monocular vision ; Object detection ; Object pose estimation ; Pose estimation ; Regression ; Robotics ; Rotation ; Solid modeling ; Three-dimensional displays ; Translations ; Vectors</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-02, Vol.34 (2), p.1098-1113</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c296t-965be96d20073fc27b5b36ab62174b7a39cdd6c8de6ed93440d14b83b581ba6d3</citedby><cites>FETCH-LOGICAL-c296t-965be96d20073fc27b5b36ab62174b7a39cdd6c8de6ed93440d14b83b581ba6d3</cites><orcidid>0000-0003-1949-5958 ; 0000-0002-8925-9259 ; 0000-0002-2033-2040 ; 0000-0002-0835-5792 ; 0000-0002-5550-7699</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10168178$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10168178$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Feng, Guangkun</creatorcontrib><creatorcontrib>Xu, Ting-Bing</creatorcontrib><creatorcontrib>Liu, Fulin</creatorcontrib><creatorcontrib>Liu, Mingkun</creatorcontrib><creatorcontrib>Zhenzhong, Wei</creatorcontrib><title>NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time.</description><subject>3D normal vector</subject><subject>Accuracy</subject><subject>Artificial neural networks</subject><subject>Cameras</subject><subject>Computer vision</subject><subject>Degeneration</subject><subject>direct regression</subject><subject>Disentangled representation learning</subject><subject>disentanglement</subject><subject>Estimation</subject><subject>Feature extraction</subject><subject>monocular vision</subject><subject>Object detection</subject><subject>Object pose estimation</subject><subject>Pose estimation</subject><subject>Regression</subject><subject>Robotics</subject><subject>Rotation</subject><subject>Solid modeling</subject><subject>Three-dimensional displays</subject><subject>Translations</subject><subject>Vectors</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1PAjEQhhujiYj-AeOhiefFfmy7rTeDiCYEDa5cm-12IIuwxXaJ8d9bhIOnmWSed2byIHRNyYBSou_K4fu8HDDC-IAzTSQtTlCPCqEyxog4TT0RNFOMinN0EeOKEJqrvOihcjqfZVPo7vHUh021xnOoOx_weNc4cHgGywAxNr7FCfr24RMv0vSxidB2VbtcJ0Y-4jcfAY9i12yqLrGX6GxRrSNcHWsffTyNyuFzNnkdvwwfJlnNtOwyLYUFLR0jpOCLmhVWWC4rKxktcltUXNfOyVo5kOA0z3PiaG4Vt0JRW0nH--j2sHcb_NcOYmdWfhfadNIwzThlTHGeKHag6uBjDLAw25AeDT-GErO3Z_7smb09c7SXQjeHUAMA_wJUKloo_gtcEWrx</recordid><startdate>20240201</startdate><enddate>20240201</enddate><creator>Feng, Guangkun</creator><creator>Xu, Ting-Bing</creator><creator>Liu, Fulin</creator><creator>Liu, Mingkun</creator><creator>Zhenzhong, Wei</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-1949-5958</orcidid><orcidid>https://orcid.org/0000-0002-8925-9259</orcidid><orcidid>https://orcid.org/0000-0002-2033-2040</orcidid><orcidid>https://orcid.org/0000-0002-0835-5792</orcidid><orcidid>https://orcid.org/0000-0002-5550-7699</orcidid></search><sort><creationdate>20240201</creationdate><title>NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation</title><author>Feng, Guangkun ; Xu, Ting-Bing ; Liu, Fulin ; Liu, Mingkun ; Zhenzhong, Wei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c296t-965be96d20073fc27b5b36ab62174b7a39cdd6c8de6ed93440d14b83b581ba6d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>3D normal vector</topic><topic>Accuracy</topic><topic>Artificial neural networks</topic><topic>Cameras</topic><topic>Computer vision</topic><topic>Degeneration</topic><topic>direct regression</topic><topic>Disentangled representation learning</topic><topic>disentanglement</topic><topic>Estimation</topic><topic>Feature extraction</topic><topic>monocular vision</topic><topic>Object detection</topic><topic>Object pose estimation</topic><topic>Pose estimation</topic><topic>Regression</topic><topic>Robotics</topic><topic>Rotation</topic><topic>Solid modeling</topic><topic>Three-dimensional displays</topic><topic>Translations</topic><topic>Vectors</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Feng, Guangkun</creatorcontrib><creatorcontrib>Xu, Ting-Bing</creatorcontrib><creatorcontrib>Liu, Fulin</creatorcontrib><creatorcontrib>Liu, Mingkun</creatorcontrib><creatorcontrib>Zhenzhong, Wei</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Feng, Guangkun</au><au>Xu, Ting-Bing</au><au>Liu, Fulin</au><au>Liu, Mingkun</au><au>Zhenzhong, Wei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-02-01</date><risdate>2024</risdate><volume>34</volume><issue>2</issue><spage>1098</spage><epage>1113</epage><pages>1098-1113</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2023.3290617</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0003-1949-5958</orcidid><orcidid>https://orcid.org/0000-0002-8925-9259</orcidid><orcidid>https://orcid.org/0000-0002-2033-2040</orcidid><orcidid>https://orcid.org/0000-0002-0835-5792</orcidid><orcidid>https://orcid.org/0000-0002-5550-7699</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1051-8215 |
ispartof | IEEE transactions on circuits and systems for video technology, 2024-02, Vol.34 (2), p.1098-1113 |
issn | 1051-8215 1558-2205 |
language | eng |
recordid | cdi_proquest_journals_2923122833 |
source | IEEE Electronic Library (IEL) |
subjects | 3D normal vector Accuracy Artificial neural networks Cameras Computer vision Degeneration direct regression Disentangled representation learning disentanglement Estimation Feature extraction monocular vision Object detection Object pose estimation Pose estimation Regression Robotics Rotation Solid modeling Three-dimensional displays Translations Vectors |
title | NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T03%3A44%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=NVR-Net:%20Normal%20Vector%20Guided%20Regression%20Network%20for%20Disentangled%206D%20Pose%20Estimation&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Feng,%20Guangkun&rft.date=2024-02-01&rft.volume=34&rft.issue=2&rft.spage=1098&rft.epage=1113&rft.pages=1098-1113&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2023.3290617&rft_dat=%3Cproquest_RIE%3E2923122833%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2923122833&rft_id=info:pmid/&rft_ieee_id=10168178&rfr_iscdi=true |