HSSHG: Heuristic Semantics-Constrained Spatio-Temporal Heterogeneous Graph for VideoQA

Video question answering is a challenging task that requires models to recognize visual information in videos and perform spatio-temporal reasoning. Current models increasingly focus on enabling objects spatio-temporal reasoning via graph neural networks. However, the existing graph network-based mo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2024, Vol.26, p.11176-11190
Hauptverfasser: Wang, Ruomei, Luo, Yuanmao, Zhang, Fuwei, Liu, Mingyang, Luo, Xiaonan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Video question answering is a challenging task that requires models to recognize visual information in videos and perform spatio-temporal reasoning. Current models increasingly focus on enabling objects spatio-temporal reasoning via graph neural networks. However, the existing graph network-based models still have deficiencies when constructing the spatio-temporal relationship between objects: (1) The lack of consideration of the spatio-temporal constraints between objects when defining the adjacency relationship; (2) The semantic correlation between objects is not fully considered when generating edge weights. These make the model lack representation of spatio-temporal interaction between objects, which directly affects the ability of object relation reasoning. To solve the above problems, this paper designs a heuristic semantics-constrained spatio-temporal heterogeneous graph, employing a semantic consistency-aware strategy to construct the spatio-temporal interaction between objects. The spatio-temporal relationship between objects is constrained by the object co-occurrence relationship and the object consistency. The plot summaries and object locations are used as heuristic semantic priors to constrain the weights of spatial and temporal edges. The spatio-temporal heterogeneity graph more accurately restores the spatio-temporal relationship between objects and strengthens the model's object spatio-temporal reasoning ability. Based on the spatio-temporal heterogeneous graph, this paper proposes Heuristic Semantics-constrained Spatio-temporal Heterogeneous Graph for VideoQA (HSSHG), which achieves state-of-the-art performance on benchmark MSVD-QA and FrameQA datasets, and demonstrates competitive results on benchmark MSRVTT-QA and ActivityNet-QA dataset. Extensive ablation experiments verify the effectiveness of each component in the network and the rationality of hyperparameter settings, and qualitative analysis verifies the object-level spatio-temporal reasoning ability of HSSHG.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2024.3443661