Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning
Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level, using these concepts to enhance video features and prov...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Dense video captioning aims to detect and describe all events in untrimmed
videos. This paper presents a dense video captioning network called
Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple
concepts at the frame level, using these concepts to enhance video features and
provide temporal event cues; and (2) design cyclic co-learning between the
generator and the localizer within the captioning network to promote semantic
perception and event localization. Specifically, we perform weakly supervised
concept detection for each frame, and the detected concept embeddings are
integrated into the video features to provide event cues. Additionally,
video-level concept contrastive learning is introduced to obtain more
discriminative concept embeddings. In the captioning network, we establish a
cyclic co-learning strategy where the generator guides the localizer for event
localization through semantic matching, while the localizer enhances the
generator's event semantic perception through location matching, making
semantic perception and event localization mutually beneficial. MCCL achieves
state-of-the-art performance on the ActivityNet Captions and YouCook2 datasets.
Extensive experiments demonstrate its effectiveness and interpretability. |
---|---|
DOI: | 10.48550/arxiv.2412.11467 |