Thinking Inside Uncertainty: Interest Moment Perception for Diverse Temporal Grounding

Given a language query, temporal grounding task is to localize temporal boundaries of the described event in an untrimmed video. There is a long-standing challenge that multiple moments may be associated with one same video-query pair, termed label uncertainty. However, existing methods struggle to...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2022-10, Vol.32 (10), p.7190-7203
Hauptverfasser: Zhou, Hao, Zhang, Chongyang, Luo, Yan, Hu, Chuanping, Zhang, Wenjun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Given a language query, temporal grounding task is to localize temporal boundaries of the described event in an untrimmed video. There is a long-standing challenge that multiple moments may be associated with one same video-query pair, termed label uncertainty. However, existing methods struggle to localize diverse moments due to the lack of multi-label annotations. In this paper, we propose a novel Diverse Temporal Grounding framework (DTG) to achieve diverse moment localization with only single-label annotations. By delving into the label uncertainty, we find the diverse moments retrieved tend to involve similar actions/objects, driving us to perceive these interest moments. Specifically, we construct soft multi-label through semantic similarity of multiple video-query pairs. These soft labels reveal whether multiple moments in the intra-videos contain similar verbs/nouns, thereby guiding interest moment generation. Meanwhile, we put forward a diverse moment regression network (DMRNet) to achieve multiple predictions in a single pass, where plausible moments are dynamically picked out from the interest moments for joint optimization. Moreover, we introduce new metrics that better reveal multi-output performance. Extensive experiments conducted on Charades-STA and ActivityNet Captions show that our method achieves state-of-the-art performance in terms of both standard and new metrics.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2022.3179314