TSwinPose: Enhanced monocular 3D human pose estimation with JointFlow
Monocular estimation of 3D human poses is challenging due to ambiguity in depths and partial occlusion. Most recent works define this as a 2D-to-3D lifting task, taking 2D key point sequences and using spatial and temporal relationships. However, prior works focus on capturing spatio-temporal correl...
Gespeichert in:
Veröffentlicht in: | Expert systems with applications 2024-09, Vol.249, p.123545, Article 123545 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Monocular estimation of 3D human poses is challenging due to ambiguity in depths and partial occlusion. Most recent works define this as a 2D-to-3D lifting task, taking 2D key point sequences and using spatial and temporal relationships. However, prior works focus on capturing spatio-temporal correlations but ignore the motion of joints that is needed for continuous estimation. To extend the potential of 2D-to-3D pose estimation, we propose TSwinPose, which learns multi-scale spatio-temporal representations from 2D key point locations and patterns of motion. The input 2D key point sequences are enhanced by JointFlow, which encodes the motion of each human joint. Based on Swin-Transformer, we designed a temporal domain Swin-Unet structure to model multi-scale spatio-temporal relationships of human joints across different temporal windows. The final 3D pose generated by multi-stage representations is consistent temporally and has a higher accuracy. Experiments conducted on three benchmark datasets, Human3.6M, MPI-INF-3DHP, and HumanEva-I, demonstrate that TSwinPose achieves performance that is on par with state-of-the-art methods. Moreover, the introduction of JointFlow as a plug-in extension enhances performance significantly, particularly benefiting long-term 2D-to-3D lifting human pose estimation methods.
•TSwinPose, a novel multi-scale spatio-temporal monocular 3D pose estimation method.•JointFlow, a plug-in extension which boosts 2D-to-3D lifting human pose estimation.•Thorough analysis shows SOTA results and confirms the performance boost of JointFlow. |
---|---|
ISSN: | 0957-4174 1873-6793 |
DOI: | 10.1016/j.eswa.2024.123545 |