A visual question answering model based on image captioning: A visual question answering

Image captioning and visual question answering are two important tasks in the field of artificial intelligence, which have been widely used in various aspects of life and greatly facilitate our daily life. Image captioning and visual question answering have many similarities and use basically the sa...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Multimedia systems 2024, Vol.30 (6)
Hauptverfasser: Zhou, Kun, Liu, Qiongjie, Zhao, Dexin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Image captioning and visual question answering are two important tasks in the field of artificial intelligence, which have been widely used in various aspects of life and greatly facilitate our daily life. Image captioning and visual question answering have many similarities and use basically the same related knowledge and techniques. They are both cross-modal tasks involving computer vision and natural language processing, and can be studied in the same model and use the image captioning results to enhance the visual question answering output. Current research on these two tasks has largely been conducted independently, and the accuracy of the visual question answering results needs to be improved. Therefore, this paper proposes a visual question answering model IC-VQA based on image captioning. This model first performs the image captioning part, i.e., obtaining rich visual information by constructing object geometric relations and utilizing mesh information, and then generates question-specific image captioning by means of Attention+ Transformer. Transformer to generate question-specific image captioning sentences. Then the visual question answering part is performed, i.e., the previously generated image captioning sentences are fused to answer the question through the Attention+ LSTM framework, which significantly improves the accuracy of the answer. Experiments on the datasets VQA1.0 and VQA2.0 resulted in an overall accuracy of 70.1 and 70.85, respectively, which significantly closes the gap with humans, which proves the effectiveness of the IC-VQA model, and the accuracy of the visual question answering output can be truly improved by fusing the captioning sentences about the question.
ISSN:0942-4962
1432-1882
DOI:10.1007/s00530-024-01573-9