LRB-Net: Improving VQA via division of labor strategy and multimodal classifiers

Visual question answering (VQA), along with multiple types of image and textual questions, makes it a challenging task to infer the correct answer. Consequently, traditional methods rely on relevant cross-modal objects and seldom leverage the cooperation of visual appearance and textual understandin...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Displays 2022-12, Vol.75, p.102329, Article 102329
Hauptverfasser:	Feng, Jiangfan, Liu, Ruiguo
Format:	Artikel
Sprache:	eng
Schlagworte:	Division of labor strategy Multimodal Visual Question Answering (VQA) Visual–textual understanding
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Visual question answering (VQA), along with multiple types of image and textual questions, makes it a challenging task to infer the correct answer. Consequently, traditional methods rely on relevant cross-modal objects and seldom leverage the cooperation of visual appearance and textual understanding. Here, we present LRBNet, a model-based approach to the problem by analyzing VQA as a division of labor strategy. Using a dictionary and pre-trained GloVe vectors, we embedded region captions and questions by the LSTM. Furthermore, we use graphs to model image region captions and features and then feed them into two GNN-based networks to capture the semantic and visual relations. Finally, we modulate the vertex features of multimodal graphs. Thus, the question embedding and vertex features are fed into the multi-level answer predictor to produce the results. We experimentally validate that LRBNet is an exciting framework for visual–textual understanding and a more challenging alternative to better VQA because it needs to understand the image to be successful. Our study provides complementary prediction through hierarchical representation within and beyond the interactive understanding of the textual sequence and the images, and the experimental results show that LRBNet outperforms other leading models in most cases. •We propose a novel VQA model that simulates the human brain’s physiological structure and function.•We provide an Interactive module to imitate the information exchange process of the corpus callosum.•Further, we show an ensemble learning-based answering strategy for our proposed model.•Our model achieves state-of-the-art performance compared to other leading VQA models.
ISSN:	0141-9382 1872-7387
DOI:	10.1016/j.displa.2022.102329