R-VQA: A robust visual question answering model

Visual Question Answering (VQA) involves generating answers to questions about visual content, such as images. VQA models process an image and a question to produce an answer. One major challenge in this domain is robustness, as current VQA models often operate within a fixed answer space and strugg...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Knowledge-based systems 2025-01, Vol.309, p.112827, Article 112827
Hauptverfasser: Chowdhury, Souvik, Soni, Badal
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Visual Question Answering (VQA) involves generating answers to questions about visual content, such as images. VQA models process an image and a question to produce an answer. One major challenge in this domain is robustness, as current VQA models often operate within a fixed answer space and struggle with issues related to language prior (favoring frequent answers) and compositional reasoning (difficulty with complex object relationships). While existing research addresses these challenges separately, no work has tackled both language prior and compositional reasoning simultaneously. This paper presents three key contributions: the development of a dataset specifically designed to address language prior and compositional reasoning issues, the creation of a unified model capable of addressing both problems in a single inference, and the ability to generate answers beyond a predefined answer space. Our proposed model, R-VQA, demonstrates superior performance compared to state-of-the-art (SOTA) models across various VQA datasets. [Display omitted] •New VQA Dataset: Benchmarks language prior and compositional reasoning challenges.•Enhanced QA Cleaning: Custom moderation ensures clean, relevant question-answer pairs.•’R-VQA’ Model: Tackles language prior and compositional reasoning in VQA tasks.•Expanded Answer Space: Overcomes response space limitations in current VQA models.•Improved Text Understanding: Enhances answering questions about text in images.
ISSN:0950-7051
DOI:10.1016/j.knosys.2024.112827