Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention

Video sentence grounding (VSG) is the task of identifying the segment of an untrimmed video that semantically corresponds to a given natural language query. While many existing methods extract frame-grained features using pre-trained 2D or 3D convolution networks, often fail to capture subtle differ...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2024, Vol.26, p.11204-11218
Hauptverfasser: Xiong, Zeyu, Liu, Daizong, Fang, Xiang, Qu, Xiaoye, Dong, Jianfeng, Zhu, Jiahao, Tang, Keke, Zhou, Pan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!