Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning

Given a question about an image, Visual Commonsense Reasoning (VCR) needs to provide not only a correct answer, but also a rationale to justify the answer. VCR is a challenging task due to the requirement of proper semantic alignment and reasoning between the image and linguistic expression. Recent...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2022, Vol.24, p.2986-2997
Hauptverfasser: Zhang, Xi, Zhang, Feifei, Xu, Changsheng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Given a question about an image, Visual Commonsense Reasoning (VCR) needs to provide not only a correct answer, but also a rationale to justify the answer. VCR is a challenging task due to the requirement of proper semantic alignment and reasoning between the image and linguistic expression. Recent approaches offer a great promise by exploring holistic attention mechanisms or graph-based networks, but most of them do implicit reasoning and ignore the semantic dependencies among the linguistic expression. In this paper, we propose a novel explicit cross-modal representation learning network for VCR by incorporating syntactic information into the visual reasoning and natural language understanding. The proposed method enjoys several merits. First, based on a two-branch neural module network, we can do explicit cross-modal reasoning guided by the high-level syntactic structure of linguistic expression. Second, the semantic structure of the linguistic expression is incorporated into a syntactic GCN to facilitate language understanding. Third, our explicit cross-modal representation learning network can provide a traceable reasoning-flow, which offers visible fine-grained evidence of the answer and rationale. Quantitative and qualitative evaluations on the public VCR dataset demonstrate that our approach performs favorably against state-of-the-art methods.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2021.3091882