Closed-loop reasoning with graph-aware dense interaction for visual dialog

Visual dialog is one attractive vision-language task to predict correct answer according to the given question, dialog history and image. Although researchers have offered diversified solutions to contact text with vision, multi-modal information still get inadequate interaction for semantic alignme...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Multimedia systems 2022, Vol.28 (5), p.1823-1832
Hauptverfasser:	Liu, An-An, Zhang, Guokai, Xu, Ning, Guo, Junbo, Jin, Guoqing, Li, Xuanya
Format:	Artikel
Sprache:	eng
Schlagworte:	Ablation Computer Communication Networks Computer Graphics Computer Science Cryptology Data Storage Representation Multimedia Information Systems Operating Systems Reasoning Regular Paper Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Visual dialog is one attractive vision-language task to predict correct answer according to the given question, dialog history and image. Although researchers have offered diversified solutions to contact text with vision, multi-modal information still get inadequate interaction for semantic alignment. To solve the problem, we propose closed-loop reasoning with graph-aware dense interaction, aiming to discover cues through the dynamic structure of graph and leverage it to benefit dialog and image features. Moreover, we analyze the statistics of the linguistic entities hidden in dialog to prove the reliability of graph construction. Experiments are set up on two VisDial datasets, which indicate that our model achieves the competitive results against the previous methods. Ablation study and parameter analysis can further demonstrate the effectiveness of our model.
ISSN:	0942-4962 1432-1882
DOI:	10.1007/s00530-022-00947-1