NVR-Net: Normal Vector Guided Regression Network for Disentangled 6D Pose Estimation

Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contras...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2024-02, Vol.34 (2), p.1098-1113
Hauptverfasser:	Feng, Guangkun, Xu, Ting-Bing, Liu, Fulin, Liu, Mingkun, Zhenzhong, Wei
Format:	Artikel
Sprache:	eng
Schlagworte:	3D normal vector Accuracy Artificial neural networks Cameras Computer vision Degeneration direct regression Disentangled representation learning disentanglement Estimation Feature extraction monocular vision Object detection Object pose estimation Pose estimation Regression Robotics Rotation Solid modeling Three-dimensional displays Translations Vectors
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Monocular 6D pose estimation for objects is an essential but challenging task that is commonly applied in computer vision and robotics. Existing two-stage methods solve for rotations with Perspective-n-Point (PnP), which still incorporates translations, resulting in accuracy degeneration. In contrast, direct regression methods adopt Convolutional Neural Networks (CNNs) to solve for rotations and translations jointly but suffer from performance gaps in rotation accuracy. In this article, we propose a novel Normal Vector guided Regression Network (NVR-Net) to directly regress the 6D pose from a single RGB image under the guidance of 3D normal vectors. Specifically, we design a novel Orientation-Aware Feature (OAF) for pose estimation. It consists of two corresponding sets of 3D normal vectors to thoroughly disentangle rotation from translation estimation. Then, we introduce a CNN to predict a dense pixelwise representation of the OAF without viewpoint ambiguity. To estimate rotations and translations individually from the OAF, we propose a novel Pose from Normal Vectors (PNV) head networks under the instruction of a differentiable closed-form solution. Finally, extensive experiments on three common benchmarks demonstrate that our approach outperforms state-of-the-art methods on rotation accuracy and removes the gap between indirect and end-to-end methods. Moreover, our method can estimate the 6D pose of a single object within an RGB image in real-time.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2023.3290617