Momentum Cross-Modal Contrastive Learning for Video Moment Retrieval
Video moment retrieval aims to locate the timestamps best matching the query description within an untrimmed video. However, existing video moment retrieval approaches typically suffer from two major limitations: (1) Utilize only negative moment-sentence pairs sampled from intra-videos, which may ov...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on circuits and systems for video technology 2024-07, Vol.34 (7), p.5977-5994 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Video moment retrieval aims to locate the timestamps best matching the query description within an untrimmed video. However, existing video moment retrieval approaches typically suffer from two major limitations: (1) Utilize only negative moment-sentence pairs sampled from intra-videos, which may overfit the bias of the dataset and not have an excellent understanding of the video and query due to the dataset size and annotation biases. (2) Decouple the video and the query, perform unimodal learning separately, and then concatenate them together as multimodal fusion features. In this paper, we propose a novel approach named Momentum Contrastive Matching Network(MCMN). Inspired by MoCo, we propose the Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative sample interactions, which contributes to the generation of more precise and discriminative representations, and use temporal decay to model key attenuation in the memory queue when computing the contrastive loss. In addition, we use an attention module to adaptively generate clip-specific word embeddings to achieve semantic alignment from a temporal perspective, which are considered to be more important for finding relevant video contents with large boundary ambiguities. Experimental results on the three major video moment retrieval benchmark datasets, including TACoS, Charades-STA, and ActivityNet Captions demonstrate that MCMN surpasses previous methods and reaches state-of-the-art with disparate visual features. |
---|---|
ISSN: | 1051-8215 1558-2205 |
DOI: | 10.1109/TCSVT.2023.3344097 |