Enhancing image captioning performance based on efficientnet B0 model and transformer encoder-decoder

In recent years, improvements in natural language processing and computer vision have come together to provide automatic image caption generation. Image captioning is the process of creating a description for an image. Captioning an image needs the recognition of significant items, their properties,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Joshi, Abhisht, Alkhayyat, Ahmed, Gunwant, Harsh, Tripathi, Abhay, Sharma, Moolchand
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Algorithms Artificial neural networks Computer vision Encoders-Decoders Image enhancement Machine learning Natural language processing Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In recent years, improvements in natural language processing and computer vision have come together to provide automatic image caption generation. Image captioning is the process of creating a description for an image. Captioning an image needs the recognition of significant items, their properties, and their connections within the image. Additionally, it must create phrases that are syntactically and semantically accurate. Deep learning approaches can address the complexities and difficulties associated with image captions. This paper describes a joint model which is capable of automatically captioning images using EfficientNet-B0 and a transformer with multi-head attention. The model is an aggregation of an EfficientNet & Transformer single encoder and decoder. The encoder utilizes EfficientNet-B0, a convolutional neural network-based algorithm that generates a detailed input image, represented by embedding them into a fixed-length vector. The decoder employs a transformer, and a multi-head attention mechanism to selectively concentrate attention on certain regions of images to predict the sentence. The proposed model was trained using a large dataset Flickr8k to optimize BLEU N-Gram (N=1,2,3,4), METEOR score, and CIDEr, to assess the probability of the target description phrase given in the training images. Our studies show that the proposed model can produce captions for images automatically.
ISSN:	0094-243X 1551-7616
DOI:	10.1063/5.0184395