Comparative study of Transformer and LSTM Network with attention mechanism on Image Captioning
In a globalized world at the present epoch of generative intelligence, most of the manual labour tasks are automated with increased efficiency. This can support businesses to save time and money. A crucial component of generative intelligence is the integration of vision and language. Consequently,...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In a globalized world at the present epoch of generative intelligence, most
of the manual labour tasks are automated with increased efficiency. This can
support businesses to save time and money. A crucial component of generative
intelligence is the integration of vision and language. Consequently, image
captioning become an intriguing area of research. There have been multiple
attempts by the researchers to solve this problem with different deep learning
architectures, although the accuracy has increased, but the results are still
not up to standard. This study buckles down to the comparison of Transformer
and LSTM with attention block model on MS-COCO dataset, which is a standard
dataset for image captioning. For both the models we have used pretrained
Inception-V3 CNN encoder for feature extraction of the images. The Bilingual
Evaluation Understudy score (BLEU) is used to checked the accuracy of caption
generated by both models. Along with the transformer and LSTM with attention
block models,CLIP-diffusion model, M2-Transformer model and the X-Linear
Attention model have been discussed with state of the art accuracy. |
---|---|
DOI: | 10.48550/arxiv.2303.02648 |