Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task, demanding intelligent systems to accurately respond to natural language queries based on audio-video input pairs. Nevertheless, prevalent AVQA approaches are prone to overlearning dataset biases, resulting in poor robust...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning
task, demanding intelligent systems to accurately respond to natural language
queries based on audio-video input pairs. Nevertheless, prevalent AVQA
approaches are prone to overlearning dataset biases, resulting in poor
robustness. Furthermore, current datasets may not provide a precise diagnostic
for these methods. To tackle these challenges, firstly, we propose a novel
dataset, MUSIC-AVQA-R, crafted in two steps: rephrasing questions within the
test split of a public dataset (MUSIC-AVQA) and subsequently introducing
distribution shifts to split questions. The former leads to a large, diverse
test space, while the latter results in a comprehensive robustness evaluation
on rare, frequent, and overall questions. Secondly, we propose a robust
architecture that utilizes a multifaceted cycle collaborative debiasing
strategy to overcome bias learning. Experimental results show that this
architecture achieves state-of-the-art performance on MUSIC-AVQA-R, notably
obtaining a significant improvement of 9.32%. Extensive ablation experiments
are conducted on the two datasets mentioned to analyze the component
effectiveness within the debiasing strategy. Additionally, we highlight the
limited robustness of existing multi-modal QA methods through the evaluation on
our dataset. We also conduct experiments combining various baselines with our
proposed strategy on two datasets to verify its plug-and-play capability. Our
dataset and code are available at https://github.com/reml-group/MUSIC-AVQA-R. |
---|---|
DOI: | 10.48550/arxiv.2404.12020 |