Similarity- and Quality-Guided Relation Learning for Joint Detection and Tracking

Joint detection and tracking, which solves two fundamental vision challenges in a unified manner, is a challenging topic in computer vision. In this area, the proper use of spatial-temporal information in videos can help reduce local defects and improve the quality of feature representations. Althou...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2024-01, Vol.26, p.1-13
Hauptverfasser:	Feng, Weitao, Bai, Lei, Yao, Yongqiang, Gan, Weihao, Wu, Wei, Ouyang, Wanli
Format:	Artikel
Sprache:	eng
Schlagworte:	Affinity Computer vision Correlation Feature extraction Instance-level Spatial-temporal Aggregation Joint Detection and Tracking Learning Modelling Modules Multi-Object Tracking Multiple target tracking Object detection Relation Learning Representations Semantics Similarity Similarity- and Quality-Guided Attention Target tracking Task analysis Videos
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Joint detection and tracking, which solves two fundamental vision challenges in a unified manner, is a challenging topic in computer vision. In this area, the proper use of spatial-temporal information in videos can help reduce local defects and improve the quality of feature representations. Although modeling low-level (usually pixel-wise) spatial-temporal information has been studied, instance-level spatial-temporal correlations (i.e., relations between semantic regions in which instances have occurred) have not been fully exploited. In comparison, modeling instance-level correlation is a more flexible and reasonable way to enhance feature representations. However, we have found that conventional instance-level relation learning that works for the separate tasks of detection or tracking is not effective in joint tasks in which a variety of scenarios may be presented. To try to resolve this problem, in this study, we effectively exploited instance-level spatial-temporal semantic information for joint detection and tracking via a joint relation learning pipeline with a novel relation learning mechanism called Similarity- and Quality-Guided Attention (SQGA). Specifically, we added task-specific SQGA relation modules before the corresponding task prediction heads to refine the instance feature representation using features of other reference instances in the neighboring frames; these features are aggregated on the basis of relational affinities. In particular, in SQGA, relational affinities were factorized to similarity and quality terms so that fine-grained supervision rules could be applied. Then we added task-specific attention losses for each SQGA relation module, resulting in a better feature aggregation for the corresponding task. Quantitative experiments based on several challenging multi-object tracking benchmarks showed that our approach was more effective than the baselines and provided competitive results compared with recent state-of-the-art methods.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2023.3279670