Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels
Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence). Existing studies either start from collecting exhaustive frame-wise annotations on the temporal boundary of target moments (fully-supervised), or learn with...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Video moment retrieval (VMR) is to search for a visual temporal moment in an
untrimmed raw video by a given text query description (sentence). Existing
studies either start from collecting exhaustive frame-wise annotations on the
temporal boundary of target moments (fully-supervised), or learn with only the
video-level video-text pairing labels (weakly-supervised). The former is poor
in generalisation to unknown concepts and/or novel scenes due to restricted
dataset scale and diversity under expensive annotation costs; the latter is
subject to visual-textual mis-correlations from incomplete labels. In this
work, we introduce a new approach called hybrid-learning video moment retrieval
to solve the problem by knowledge transfer through adapting the video-text
matching relationships learned from a fully-supervised source domain to a
weakly-labelled target domain when they do not share a common label space. Our
aim is to explore shared universal knowledge between the two domains in order
to improve model learning in the weakly-labelled target domain. Specifically,
we introduce a multiplE branch Video-text Alignment model (EVA) that performs
cross-modal (visual-textual) matching information sharing and multi-modal
feature alignment to optimise domain-invariant visual and textual features as
well as per-task discriminative joint video-text representations. Experiments
show EVA's effectiveness in exploring temporal segment annotations in a source
domain to help learn video moment retrieval without temporal labels in a target
domain. |
---|---|
DOI: | 10.48550/arxiv.2406.01791 |