Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding in Online Information Systems

With the surge in online video content, finding highlights and key video segments have garnered widespread attention. Given a textual query, video highlight detection (HD) and temporal grounding (TG) aim to predict frame-wise saliency scores from a video while concurrently locating all relevant span...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal on semantic web and information systems 2023-01, Vol.19 (1), p.1-20
Hauptverfasser: Xu, Yifang, Sun, Yunzhuo, Xie, Zien, Zhai, Benxiang, Jia, Youyao, Du, Sidan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:With the surge in online video content, finding highlights and key video segments have garnered widespread attention. Given a textual query, video highlight detection (HD) and temporal grounding (TG) aim to predict frame-wise saliency scores from a video while concurrently locating all relevant spans. Despite recent progress in DETR-based works, these methods crudely fuse different inputs in the encoder, which limits effective cross-modal interaction. To solve this challenge, the authors design QD-Net (query-guided refinement and dynamic spans network) tailored for HD&TG. Specifically, they propose a query-guided refinement module to decouple the feature encoding from the interaction process. Furthermore, they present a dynamic span decoder that leverages learnable 2D spans as decoder queries, which accelerates training convergence for TG. On QVHighlights dataset, the proposed QD-Net achieves 61.87 HD-HIT@1 and 61.88 TG-mAP@0.5, yielding a significant improvement of +1.88 and +8.05, respectively, compared to the state-of-the-art method.
ISSN:1552-6283
1552-6291
DOI:10.4018/IJSWIS.332768