Joint-Motion Mutual Learning for Pose Estimation in Videos
Human pose estimation in videos has long been a compelling yet challenging task within the realm of computer vision. Nevertheless, this task remains difficult because of the complex video scenes, such as video defocus and self-occlusion. Recent methods strive to integrate multi-frame visual features...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Human pose estimation in videos has long been a compelling yet challenging
task within the realm of computer vision. Nevertheless, this task remains
difficult because of the complex video scenes, such as video defocus and
self-occlusion. Recent methods strive to integrate multi-frame visual features
generated by a backbone network for pose estimation. However, they often ignore
the useful joint information encoded in the initial heatmap, which is a
by-product of the backbone generation. Comparatively, methods that attempt to
refine the initial heatmap fail to consider any spatio-temporal motion
features. As a result, the performance of existing methods for pose estimation
falls short due to the lack of ability to leverage both local joint (heatmap)
information and global motion (feature) dynamics.
To address this problem, we propose a novel joint-motion mutual learning
framework for pose estimation, which effectively concentrates on both local
joint dependency and global pixel-level motion dynamics. Specifically, we
introduce a context-aware joint learner that adaptively leverages initial
heatmaps and motion flow to retrieve robust local joint feature. Given that
local joint feature and global motion flow are complementary, we further
propose a progressive joint-motion mutual learning that synergistically
exchanges information and interactively learns between joint feature and motion
flow to improve the capability of the model. More importantly, to capture more
diverse joint and motion cues, we theoretically analyze and propose an
information orthogonality objective to avoid learning redundant information
from multi-cues. Empirical experiments show our method outperforms prior arts
on three challenging benchmarks. |
---|---|
DOI: | 10.48550/arxiv.2408.02285 |