Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering
WACV2025 This paper tackles the intricate challenge of video question-answering (VideoQA). Despite notable progress, current methods fall short of effectively integrating questions with video frames and semantic object-level abstractions to create question-aware video representations. We introduce L...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | WACV2025 This paper tackles the intricate challenge of video question-answering
(VideoQA). Despite notable progress, current methods fall short of effectively
integrating questions with video frames and semantic object-level abstractions
to create question-aware video representations. We introduce Local-Global
Question Aware Video Embedding (LGQAVE), which incorporates three major
innovations to integrate multi-modal knowledge better and emphasize semantic
visual concepts relevant to specific questions. LGQAVE moves beyond traditional
ad-hoc frame sampling by utilizing a cross-attention mechanism that precisely
identifies the most relevant frames concerning the questions. It captures the
dynamics of objects within these frames using distinct graphs, grounding them
in question semantics with the miniGPT model. These graphs are processed by a
question-aware dynamic graph transformer (Q-DGT), which refines the outputs to
develop nuanced global and local video representations. An additional
cross-attention module integrates these local and global embeddings to generate
the final video embeddings, which a language model uses to generate answers.
Extensive evaluations across multiple benchmarks demonstrate that LGQAVE
significantly outperforms existing models in delivering accurate multi-choice
and open-ended answers. |
---|---|
DOI: | 10.48550/arxiv.2412.09230 |