Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention

Video sentence grounding (VSG) is the task of identifying the segment of an untrimmed video that semantically corresponds to a given natural language query. While many existing methods extract frame-grained features using pre-trained 2D or 3D convolution networks, often fail to capture subtle differ...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2024, Vol.26, p.11204-11218
Hauptverfasser:	Xiong, Zeyu, Liu, Daizong, Fang, Xiang, Qu, Xiaoye, Dong, Jianfeng, Zhu, Jiahao, Tang, Keke, Zhou, Pan
Format:	Artikel
Sprache:	eng
Schlagworte:	Cross-modal Feature extraction Grounding masked attention memory network Object tracking Semantics Target tracking Task analysis tracking Visualization VSG
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Schreiben Sie den ersten Kommentar!