Multiscale feature fusion network for monocular complex hand pose estimation

Hand pose estimation based on a single RGB image has low accuracy due to the complexity of the pose, local self‐similarity of finger features, and occlusion. A multiscale feature fusion network (MS‐FF) for monocular vision gesture pose estimation is proposed to address this problem. The network can...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Electronics letters 2023-12, Vol.59 (24), p.n/a
Hauptverfasser:	Zhan, Zhi, Luo, Guang
Format:	Artikel
Sprache:	eng
Schlagworte:	Complexity Datasets Feature extraction Feature maps Fingers Image enhancement learning (artificial intelligence) Methods Monocular vision Occlusion Optimization Pose estimation Semantics
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Hand pose estimation based on a single RGB image has low accuracy due to the complexity of the pose, local self‐similarity of finger features, and occlusion. A multiscale feature fusion network (MS‐FF) for monocular vision gesture pose estimation is proposed to address this problem. The network can take full advantage of different channel information to enhance important gesture information, and it can simultaneously extract features from feature maps of different resolutions to obtain as much detailed feature information and deep semantic information as possible. The feature maps are merged to obtain the hand pose results. The InterHand2.6M dataset and Rendered Handpose Dataset (RHD) are used to train the MS‐FF. Compared with the other methods, the MS‐FF obtains the smallest average error of hand joints, verifying its effectiveness. The authors proposed an MS‐FF for monocular visual hand pose estimation. To effectively process the detailed information of occluded edges and fingertips, the network can extract information of different levels from feature maps of different resolutions to more accurately estimate hand poses. A channel conversion module adjusts the weights of channels. To make full use of both the edge detail characteristics of the images and deep semantic information, a global regression module fuses feature maps of different resolutions. An optimization procedure corrects some joints that are not returned to the correct position. Higher accuracy and robustness were achieved using the proposed method. Experiments verified the effectiveness of the MS‐FF.
ISSN:	0013-5194 1350-911X
DOI:	10.1049/ell2.13044