Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning
Given a question about an image, Visual Commonsense Reasoning (VCR) needs to provide not only a correct answer, but also a rationale to justify the answer. VCR is a challenging task due to the requirement of proper semantic alignment and reasoning between the image and linguistic expression. Recent...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on multimedia 2022, Vol.24, p.2986-2997 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Given a question about an image, Visual Commonsense Reasoning (VCR) needs to provide not only a correct answer, but also a rationale to justify the answer. VCR is a challenging task due to the requirement of proper semantic alignment and reasoning between the image and linguistic expression. Recent approaches offer a great promise by exploring holistic attention mechanisms or graph-based networks, but most of them do implicit reasoning and ignore the semantic dependencies among the linguistic expression. In this paper, we propose a novel explicit cross-modal representation learning network for VCR by incorporating syntactic information into the visual reasoning and natural language understanding. The proposed method enjoys several merits. First, based on a two-branch neural module network, we can do explicit cross-modal reasoning guided by the high-level syntactic structure of linguistic expression. Second, the semantic structure of the linguistic expression is incorporated into a syntactic GCN to facilitate language understanding. Third, our explicit cross-modal representation learning network can provide a traceable reasoning-flow, which offers visible fine-grained evidence of the answer and rationale. Quantitative and qualitative evaluations on the public VCR dataset demonstrate that our approach performs favorably against state-of-the-art methods. |
---|---|
ISSN: | 1520-9210 1941-0077 |
DOI: | 10.1109/TMM.2021.3091882 |