Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

The automatic narration of a natural scene is an important trait in artificial intelligence that unites computer vision and natural language processing. Caption generation is a challenging task in scene understanding. Most of the state-of-the-art methods are using deep convolutional neural network m...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Neural computing & applications 2020-12, Vol.32 (24), p.17899-17908
Hauptverfasser:	Gupta, Neeraj, Jalal, Anand Singh
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence Artificial neural networks Computational Biology/Bioinformatics Computational Science and Engineering Computer Science Computer vision Data Mining and Knowledge Discovery Developing Nature-Inspired Intelligence by Neural Systems Feature extraction Image Processing and Computer Vision Natural language processing Neural networks Probability and Statistics in Computer Science Recurrent neural networks Scene analysis Sentences Speech recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The automatic narration of a natural scene is an important trait in artificial intelligence that unites computer vision and natural language processing. Caption generation is a challenging task in scene understanding. Most of the state-of-the-art methods are using deep convolutional neural network models to extract visual features of the entire image, based on which the parallel structures between images and sentences are exploited using recurrent neural networks for image captioning. However, in such models, only visual features are exploited for caption generation. This work investigated that fusion of text available in an image can give more fined-grained captioning of a scene. In this paper, we have proposed a model which incorporates a deep convolutional neural network and long short-term memory to boost the accuracy of image captioning by fusing text feature available in an image with the visual features extracted in state-of-the-art methods. We have validated the effectiveness of the proposed model on the benchmark datasets (Flickr8k and Flickr30k). The experimental outcomes illustrate that the proposed model outperformed the state-of-the-art methods for image captioning.
ISSN:	0941-0643 1433-3058
DOI:	10.1007/s00521-019-04515-z