MOTPose: Multi-object 6D Pose Estimation for Dynamic Video Sequences using Attention-based Temporal Fusion
Cluttered bin-picking environments are challenging for pose estimation models. Despite the impressive progress enabled by deep learning, single-view RGB pose estimation models perform poorly in cluttered dynamic environments. Imbuing the rich temporal information contained in the video of scenes has...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Cluttered bin-picking environments are challenging for pose estimation
models. Despite the impressive progress enabled by deep learning, single-view
RGB pose estimation models perform poorly in cluttered dynamic environments.
Imbuing the rich temporal information contained in the video of scenes has the
potential to enhance models ability to deal with the adverse effects of
occlusion and the dynamic nature of the environments. Moreover, joint object
detection and pose estimation models are better suited to leverage the
co-dependent nature of the tasks for improving the accuracy of both tasks. To
this end, we propose attention-based temporal fusion for multi-object 6D pose
estimation that accumulates information across multiple frames of a video
sequence. Our MOTPose method takes a sequence of images as input and performs
joint object detection and pose estimation for all objects in one forward pass.
It learns to aggregate both object embeddings and object parameters over
multiple time steps using cross-attention-based fusion modules. We evaluate our
method on the physically-realistic cluttered bin-picking dataset SynPick and
the YCB-Video dataset and demonstrate improved pose estimation accuracy as well
as better object detection accuracy |
---|---|
DOI: | 10.48550/arxiv.2403.09309 |