R-VQA: A robust visual question answering model
Visual Question Answering (VQA) involves generating answers to questions about visual content, such as images. VQA models process an image and a question to produce an answer. One major challenge in this domain is robustness, as current VQA models often operate within a fixed answer space and strugg...
Gespeichert in:
Veröffentlicht in: | Knowledge-based systems 2025-01, Vol.309, p.112827, Article 112827 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Visual Question Answering (VQA) involves generating answers to questions about visual content, such as images. VQA models process an image and a question to produce an answer. One major challenge in this domain is robustness, as current VQA models often operate within a fixed answer space and struggle with issues related to language prior (favoring frequent answers) and compositional reasoning (difficulty with complex object relationships). While existing research addresses these challenges separately, no work has tackled both language prior and compositional reasoning simultaneously. This paper presents three key contributions: the development of a dataset specifically designed to address language prior and compositional reasoning issues, the creation of a unified model capable of addressing both problems in a single inference, and the ability to generate answers beyond a predefined answer space. Our proposed model, R-VQA, demonstrates superior performance compared to state-of-the-art (SOTA) models across various VQA datasets.
[Display omitted]
•New VQA Dataset: Benchmarks language prior and compositional reasoning challenges.•Enhanced QA Cleaning: Custom moderation ensures clean, relevant question-answer pairs.•’R-VQA’ Model: Tackles language prior and compositional reasoning in VQA tasks.•Expanded Answer Space: Overcomes response space limitations in current VQA models.•Improved Text Understanding: Enhances answering questions about text in images. |
---|---|
ISSN: | 0950-7051 |
DOI: | 10.1016/j.knosys.2024.112827 |