HFT6D: Multimodal 6D object pose estimation based on hierarchical feature transformer
Visual information is usually multimodal, including texture, color (2D information), and space (3D information). However, there are two problems in establishing multimodal 6D object pose estimation: (1) substantial differences between RGB images and depth data; (2) systematic noise in the depth imag...
Gespeichert in:
Veröffentlicht in: | Measurement : journal of the International Measurement Confederation 2024-01, Vol.224, p.113848, Article 113848 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Visual information is usually multimodal, including texture, color (2D information), and space (3D information). However, there are two problems in establishing multimodal 6D object pose estimation: (1) substantial differences between RGB images and depth data; (2) systematic noise in the depth images and lack contextual information in the association process. To solve the above problems, this paper proposes an end-to-end hierarchical feature transformer (HFT6D) containing four independent stages of crossmodal transformer. The novel hierarchical feature architecture suppresses the effect of noise by modeling the spatial correspondence between two different modalities. The core module of HFT6D is the bi-directional crossmodal attention, which aligns the appearance and geometric representation by recalibrating RGB-D data. In addition, our proposed HFT6D is real-time and achieves robustness against occluded scenes. Comprehensive experiments on two benchmark datasets show that HFT6D achieves state-of-the-art performance in terms of accuracy and speed.
•This paper proposes an end-to-end hierarchical feature transformer (HFT6D) to solve the substantial differences between RGB-D data and systematic noise in the depth data.•The novel hierarchical feature architecture suppresses the effect of noise by correlating the spatial correspondence of high/low-resolution feature mappings.•The core module of HFT6D is the bi-directional crossmodal attention, which learns the degree of similarity between RGB-D data to achieve the alignment effect.•The proposed HFT6D can gain state-of-the-art performance on two benchmarks, YCB-Video and LineMOD. |
---|---|
ISSN: | 0263-2241 1873-412X |
DOI: | 10.1016/j.measurement.2023.113848 |