Question-conditioned debiasing with focal visual context fusion for visual question answering

Existing Visual Question Answering models suffer from the language prior, where the answers provided by the models overly rely on the correlations between questions and answers, ignoring the exact visual information, resulting in a significant drop in the out-of-distribution datasets. To eliminate s...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Knowledge-based systems 2023-10, Vol.278, p.110879, Article 110879
Hauptverfasser: Liu, Jin, Wang, GuoXiang, Fan, ChongFeng, Zhou, Fengyu, Xu, HuiJuan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Existing Visual Question Answering models suffer from the language prior, where the answers provided by the models overly rely on the correlations between questions and answers, ignoring the exact visual information, resulting in a significant drop in the out-of-distribution datasets. To eliminate such language bias, prevalent approaches mainly focus on weakening the language prior with one auxiliary question-only branch while focusing on the statistical question type–answer pairs’ distribution prior rather than that of question–answer pairs. Besides, most models provide the answer with improper visual groundings. This paper proposes a model-agnostic framework to address the above drawbacks by question-conditioned debiasing with focal visual context fusion. To begin with, instead of the question type-conditioned correlations, we overcome the language distribution shortcut from the aspect of question-conditioned correlations by removing the shortcut between questions and the most occurring answer. Additionally, we utilize the deviation of the predicted answer distribution and ground truth as the pseudo target to avoid the model falling into other frequent answers’ distribution bias. Further, we stress the imbalance of the number of images and questions that post higher requirements of a proper visual context. We improve the correct visual utilization ability based on contrastive sampling and design a focal visual context fusion module that incorporates the critical object word extracted from the question after the Part-Of-Speech tagging into the visual features to augment the salient visual information without human annotations. Extensive experiments on the three public benchmark datasets, i.e., VQA v2, VQA-CP v2, and VQA-CP v1, demonstrate the effectiveness of our model.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2023.110879