Multi-video Moment Ranking with Multimodal Clue
Video corpus moment retrieval~(VCMR) is the task of retrieving a relevant video moment from a large corpus of untrimmed videos via a natural language query. State-of-the-art work for VCMR is based on two-stage method. In this paper, we focus on improving two problems of two-stage method: (1) Moment...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Video corpus moment retrieval~(VCMR) is the task of retrieving a relevant
video moment from a large corpus of untrimmed videos via a natural language
query. State-of-the-art work for VCMR is based on two-stage method. In this
paper, we focus on improving two problems of two-stage method: (1) Moment
prediction bias: The predicted moments for most queries come from the top
retrieved videos, ignoring the possibility that the target moment is in the
bottom retrieved videos, which is caused by the inconsistency of Shared
Normalization during training and inference. (2) Latent key content: Different
modalities of video have different key information for moment localization. To
this end, we propose a two-stage model \textbf{M}ult\textbf{I}-video
ra\textbf{N}king with m\textbf{U}l\textbf{T}imodal clu\textbf{E}~(MINUTE).
MINUTE uses Shared Normalization during both training and inference to rank
candidate moments from multiple videos to solve moment predict bias, making it
more efficient to predict target moment. In addition, Mutilmdaol Clue
Mining~(MCM) of MINUTE can discover key content of different modalities in
video to localize moment more accurately. MINUTE outperforms the baselines on
TVR and DiDeMo datasets, achieving a new state-of-the-art of VCMR. Our code
will be available at GitHub. |
---|---|
DOI: | 10.48550/arxiv.2301.13606 |