Complementary spatiotemporal network for video question answering
Video question answering (VideoQA) is challenging as it requires models to capture motion and spatial semantics and to associate them with linguistic contexts. Recent methods usually treat space and time symmetrically. Since the spatial structures and temporal events often change at different speeds...
Gespeichert in:
Veröffentlicht in: | Multimedia systems 2022-02, Vol.28 (1), p.161-169 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Video question answering (VideoQA) is challenging as it requires models to capture motion and spatial semantics and to associate them with linguistic contexts. Recent methods usually treat space and time symmetrically. Since the spatial structures and temporal events often change at different speeds in the video, these methods will be difficult to distinguish spatial details and different scale motion relationships. To this end, we propose a complementary spatiotemporal network (CST) to focus on multi-scale motion relationships and essential spatial semantics. Our model involves three modules. First, multi-scale relation unit (MR) captures temporal information by modeling different distances between motions. Second, mask similarity (MS) operation captures discriminative spatial semantics in a less redundant manner. And cross-modality attention (CMA) boosts the interaction between different modalities. We evaluate our method on three benchmark datasets and conduct extensive ablation studies. The performance improvement demonstrates the effectiveness of our approach. |
---|---|
ISSN: | 0942-4962 1432-1882 |
DOI: | 10.1007/s00530-021-00805-6 |