Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention

Video sentence grounding (VSG) is the task of identifying the segment of an untrimmed video that semantically corresponds to a given natural language query. While many existing methods extract frame-grained features using pre-trained 2D or 3D convolution networks, often fail to capture subtle differ...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2024, Vol.26, p.11204-11218
Hauptverfasser: Xiong, Zeyu, Liu, Daizong, Fang, Xiang, Qu, Xiaoye, Dong, Jianfeng, Zhu, Jiahao, Tang, Keke, Zhou, Pan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Video sentence grounding (VSG) is the task of identifying the segment of an untrimmed video that semantically corresponds to a given natural language query. While many existing methods extract frame-grained features using pre-trained 2D or 3D convolution networks, often fail to capture subtle differences between ambiguous adjacent frames. Although some recent approaches incorporate object-grained features using Faster R-CNN to capture more fine-grained details, they are still primarily based on feature enhancement and lack spatio-temporal modeling to explore the semantics of the core persons/objects. To solve the problem of modeling the core target's behavior, in this paper, we propose a new perspective for addressing the VSG task by tracking pivotal objects and activities to learn more fine-grained spatio-temporal features. Specifically, we introduce the Video Sentence Tracker with Memory Network and Masked Attention (VSTMM), which comprises a cross-modal targets generator for producing multi-modal templates and search space, a memory-based tracker for dynamically tracking multi-modal targets using a memory network to record targets' behaviors, a masked attention localizer which learns local shared features between frames and eliminates interference from long-term dependencies, resulting in improved accuracy when localizing the moment. To evaluate the performance of our VSTMM, we conducted extensive experiments and comparisons with state-of-the-art methods on three challenging benchmarks, including Charades-STA, ActivityNet Captions, and TACoS. Without bells and whistles, our VSTMM achieves leading performance with a considerable real-time speed.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2024.3453062