Dynamic Debiasing Network for Visual Commonsense Generation

The task of Visual Commonsense Generation (VCG) delves into the deeper narrative behind a static image, aiming to comprehend not just its immediate content but also the surrounding context. The VCG model generates three types of captions for each image: 1) the events preceding the image, 2) the char...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2023, Vol.11, p.139706-139714
Hauptverfasser:	Kim, Jungeun, Park, Jinwoo, Seok, Jaekwang, Kim, Junyeong
Format:	Artikel
Sprache:	eng
Schlagworte:	Bias causal inference Context modeling Correlation Data models dataset bias Datasets debiasing Inference algorithms Multimodal reasoning Multisensory integration Narratives Predictive models Task analysis Task complexity Training visual commonsense generation Visual tasks VisualCOMET Visualization
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The task of Visual Commonsense Generation (VCG) delves into the deeper narrative behind a static image, aiming to comprehend not just its immediate content but also the surrounding context. The VCG model generates three types of captions for each image: 1) the events preceding the image, 2) the characters' current intents, and 3) the anticipated subsequent events. However, a significant challenge in VCG research is the prevalent yet under-addressed issue of dataset bias, which can result in spurious correlations during model training. This occurs when a model, influenced by biased data, infers associations that frequently appear in the dataset but may not provide accurate or contextually appropriate interpretations. The issue becomes even more complex in multimodal tasks, where different types of data, such as text and image, bring their unique biases. When these modalities are combined as inputs to a model, one modality might exhibit a stronger bias than others. To address this, we introduce the Dynamic Debiasing Network (DDNet) for Visual Commonsense Generation. DDNet is designed to identify the biased modality and dynamically counteract modality-specific biases using causal relationship. By considering biases from multiple modalities, DDNet avoids over-focusing on any single modality and effectively combines information from all modalities. The experimental results on the VisualCOMET dataset demonstrate that our proposed network fosters more accurate commonsense inferences. This emphasizes the critical need for debiasing in multimodal tasks and enhances the reliability of machine-generated commonsense narratives.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2023.3340705