Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction

Automatic caption generation with attention mechanisms aims at generating more descriptive captions containing coarser to finer semantic contents in the image. In this work, we use an encoder-decoder framework employing Wavelet transform based Convolutional Neural Network (WCNN) with two level discr...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of Big Data 2023-12, Vol.10 (1), p.18-18, Article 18
Hauptverfasser:	Sasibhooshan, Reshmi, Kumaraswamy, Suresh, Sasidharan, Santhoshkumar
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Big Data Cider Coders Communication Communications Engineering Computational Science and Engineering Computer Science Computer software industry Contextual Spatial Relation Extraction Data Mining and Knowledge Discovery Database Management Datasets Encoders-Decoders Engineering Feature extraction Feature maps Image captioning Image retrieval Information Storage and Retrieval Mathematical Applications in Computer Science Natural language Networks Neural networks Semantics Visual Attention Prediction Wavelet transform based Convolutional Neural Network Wavelet transforms
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Automatic caption generation with attention mechanisms aims at generating more descriptive captions containing coarser to finer semantic contents in the image. In this work, we use an encoder-decoder framework employing Wavelet transform based Convolutional Neural Network (WCNN) with two level discrete wavelet decomposition for extracting the visual feature maps highlighting the spatial, spectral and semantic details from the image. The Visual Attention Prediction Network (VAPN) computes both channel and spatial attention for obtaining visually attentive features. In addition to these, local features are also taken into account by considering the contextual spatial relationship between the different objects. The probability of the appropriate word prediction is achieved by combining the aforementioned architecture with Long Short Term Memory (LSTM) decoder network. Experiments are conducted on three benchmark datasets—Flickr8K, Flickr30K and MSCOCO datasets and the evaluation results prove the improved performance of the proposed model with CIDEr score of 124.2.
ISSN:	2196-1115 2196-1115
DOI:	10.1186/s40537-023-00693-9