Improving visual question answering by combining scene-text information

The text present in natural scenes contains semantic information about its surrounding environment. For example, the majority of questions asked by blind people related to images around them require understanding of text in the image. However, most of the existing Visual Question Answering (VQA) mod...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Multimedia tools and applications 2022-04, Vol.81 (9), p.12177-12208
Hauptverfasser:	Sharma, Himanshu, Jalal, Anand Singh
Format:	Artikel
Sprache:	eng
Schlagworte:	1177: Advances in Deep Learning for Multimodal Fusion and Alignment Accuracy Blind people Computer Communication Networks Computer Science Data Structures and Information Theory Datasets Model accuracy Multimedia Information Systems Questions Representations Special Purpose and Application-Based Systems
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The text present in natural scenes contains semantic information about its surrounding environment. For example, the majority of questions asked by blind people related to images around them require understanding of text in the image. However, most of the existing Visual Question Answering (VQA) models do not consider the text present in an image. In this paper, the proposed model fuses the multiple inputs such as visual features, questions features and OCR tokens. Also, we have captured the relationship between OCR tokens and the objects in an image, which previous model fail to use. As compared to previous model on TextVQA dataset, the proposed model uses dynamic pointer networks based decoder to predict multi-word (OCR tokens and words from fixed vocabulary) answers instead of single-step classification task. OCR tokens are represented using location, appearance, phoc and fisher vectors features in addition to the FastText features used by previous model on TextVQA. A powerful descriptor is constructed by applying Fisher Vectors (FV) which is computed from PHOCs of the text present in images. This FV based feature representation is better than the feature representation based on word embeddings only, which are used by previous state-of-the-art models. Quantitative and qualitative experiments performed on popular benchmarks including TextVQA, ST-VQA and VQA 2.0 reveal the efficacy of proposed model. Our proposed VQA model attains 41.23% on TextVQA dataset, 40.98% on ST-VQA dataset and 74.98% overall accuracy on VQA 2.0 dataset. Results suggest that there is a significant gap between human accuracy and model accuracy on TextVQA and ST-VQA datasets compared to VQA 2.0, recommending the use of TextVQA and ST-VQA datasets for future research which can complement VQA 2.0.
ISSN:	1380-7501 1573-7721
DOI:	10.1007/s11042-022-12317-0