A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors
•Accuracy of our VQA model is improved by using both visual and textual features.•Our model generates multi-word answers by employing dynamic pointer network.•Text tokens are represented by PHOC and FV embeddings together with other features.•Our model outperforms the previous models on VQA 2.0, Tex...
Gespeichert in:
Veröffentlicht in: | Expert systems with applications 2022-03, Vol.190, p.116159, Article 116159 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | •Accuracy of our VQA model is improved by using both visual and textual features.•Our model generates multi-word answers by employing dynamic pointer network.•Text tokens are represented by PHOC and FV embeddings together with other features.•Our model outperforms the previous models on VQA 2.0, Text-VQA and ST-VQA datasets.
Text contained in an image gives useful information about that image. Consider a warning signboard with text “high voltage”; it indicates the hazard or risk involved in the image. Thus, this semantic textual information can be very useful for better understanding of images, which is not utilized by the existing visual question answering (VQA) models. However, the presence of this textual information in images can strongly guide the VQA task. This work deal with the task of visual question answering by exploiting these textual cues together with the visual content to boost the accuracy of VQA models. In the work, a novel VQA model is proposed based on the PHOC and fisher vector based representation. Based on the PHOCs of the scene-text, we have constructed a powerful descriptor by using a Fisher Vectors. Also, the proposed model uses transformer model together with dynamic pointer networks for answer decoding process. Thus, the proposed model uses a sequence of decoding steps for answer generation instead of just assuming answer prediction as a classification problem as considered by previous works. We have shown the qualitative and quantitative results on three popular datasets: VQA 2.0, TextVQA and ST-VQA. The results show the effectiveness of the proposed model over the existing models. |
---|---|
ISSN: | 0957-4174 1873-6793 |
DOI: | 10.1016/j.eswa.2021.116159 |