Latent Distribution-Based 3D Hand Pose Estimation From Monocular RGB Images

In this article, we propose a novel compressed latent distribution representation for 3D hand pose estimation from monocular RGB images to alleviate the channel correspondence problem. The channel correspondence problem occurs when the 2D and depth coordinates are estimated from independent feature...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2021-12, Vol.31 (12), p.4883-4894
Hauptverfasser: Li, Moran, Wang, Jialong, Sang, Nong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In this article, we propose a novel compressed latent distribution representation for 3D hand pose estimation from monocular RGB images to alleviate the channel correspondence problem. The channel correspondence problem occurs when the 2D and depth coordinates are estimated from independent feature maps, which means the 2D and depth channel sequences may not match during the cross-dataset inference. In contrast, we propose a compressed latent distribution representation that the 2D and depth feature maps for each joint are interconnected and inter-constrained more directly, effectively alleviating the channel correspondence problem and improving cross-dataset performance. Moreover, we design an efficient encoder-decoder network that can maintain the resolution of feature maps to enable better hand feature extraction from monocular RGB images. In this work, the overall pipeline contains two branches: one is the 2D hand pose estimation branch based on a latent heatmap representation (LHR); the other is the 3D hand pose estimation branch based on our proposed latent distribution representation (LDR). In this way, the 2D estimation branch serves as guidance for the 3D branch, which simplifies the optimization of the overall network and results in a more rapid convergence during training. The results on several benchmark datasets (including STB, RHD, and the most recently released InterHand2.6M) demonstrate that our proposed method achieves state-of-the-art (SOTA) performance.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2021.3055862