Model Semantic Attention (SemAtt) With Hybrid Learning Separable Neural Network and Long Short-Term Memory to Generate Caption

Image captioning is a hot topic that combines a multidiscipline task between computer vision and natural language processing. One of the tasks in the geological field is to make descriptions from the images of geological rocks. The task of a geologist is to write a content description of an image an...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2024, Vol.12, p.154467-154481
Hauptverfasser: Nursikuwagus, Agus, Munir, Rinaldi, Khodra, Masayu L., Dewi, Deshinta Arrova
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Image captioning is a hot topic that combines a multidiscipline task between computer vision and natural language processing. One of the tasks in the geological field is to make descriptions from the images of geological rocks. The task of a geologist is to write a content description of an image and display it as text that can be used in the future. Interpretation of the object is one of the objectives of the research, which is to traverse the image structures in depth. Shapes, colors, and structures are to be focused on to get the image's features. The problem faced is how the separable neural network (SNN) and long short-term memory (LSTM) have an impact on the caption that can meet the geologist's description. SNN is called Visual Attention (VaT), and LSTM is called Semantic Attention (SemAtt) as an architecture of image captioning. The result of the experiment confirms that the accuracy model for captioning gets BLEU- 1=0.908 , BLEU- 2=0.877 , BLEU- 3=0.750 , and BLEU- 4=0.510 . The evaluation score is compared to those of other evaluators, such as Meteor and RougeL, which get 0.670 and 0.623, respectively. The model confirms that it outperforms the baseline model. Referring to the evaluations, we concluded that the model was able to generate captioned geological rock images that met the geologist's description. Precision and recall have supported the models in providing the predicted word that is suitable for the image features.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2024.3481499