CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion
Video saliency prediction aims to identify the regions in a video that attract human attention and gaze, driven by bottom-up features from the video and top-down processes like memory and cognition. Among these top-down influences, language plays a crucial role in guiding attention by shaping how vi...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Video saliency prediction aims to identify the regions in a video that
attract human attention and gaze, driven by bottom-up features from the video
and top-down processes like memory and cognition. Among these top-down
influences, language plays a crucial role in guiding attention by shaping how
visual information is interpreted. Existing methods primarily focus on modeling
perceptual information while neglecting the reasoning process facilitated by
language, where ranking cues are crucial outcomes of this process and practical
guidance for saliency prediction. In this paper, we propose CaRDiff (Caption,
Rank, and generate with Diffusion), a framework that imitates the process by
integrating a multimodal large language model (MLLM), a grounding module, and a
diffusion model, to enhance video saliency prediction. Specifically, we
introduce a novel prompting method VSOR-CoT (Video Salient Object Ranking Chain
of Thought), which utilizes an MLLM with a grounding module to caption video
content and infer salient objects along with their rankings and positions. This
process derives ranking maps that can be sufficiently leveraged by the
diffusion model to decode the saliency maps for the given video accurately.
Extensive experiments show the effectiveness of VSOR-CoT in improving the
performance of video saliency prediction. The proposed CaRDiff performs better
than state-of-the-art models on the MVS dataset and demonstrates cross-dataset
capabilities on the DHF1k dataset through zero-shot evaluation. |
---|---|
DOI: | 10.48550/arxiv.2408.12009 |