Knowledge graph construction in hyperbolic space for automatic image annotation

Automatic image annotation (AIA) is a fundamental and challenging task in computer vision. Considering the correlations between tags can lead to more accurate image understanding, benefiting various applications, including image retrieval and visual search. While many attempts have been made to inco...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Image and vision computing 2024-11, Vol.151, p.105293, Article 105293
Hauptverfasser:	Lotfi, Fariba, Jamzad, Mansour, Beigy, Hamid, Farhood, Helia, Sheng, Quan Z., Beheshti, Amin
Format:	Artikel
Sprache:	eng
Schlagworte:	Attributed knowledge graph Automatic image annotation External knowledge sources Hyperbolic space Relational graph convolutional network Vision transformer
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Automatic image annotation (AIA) is a fundamental and challenging task in computer vision. Considering the correlations between tags can lead to more accurate image understanding, benefiting various applications, including image retrieval and visual search. While many attempts have been made to incorporate tag correlations in annotation models, the method of constructing a knowledge graph based on external knowledge sources and hyperbolic space has not been explored. In this paper, we create an attributed knowledge graph based on vocabulary, integrate external knowledge sources such as WordNet, and utilize hyperbolic word embeddings for the tag representations. These embeddings provide a sophisticated tag representation that captures hierarchical and complex correlations more effectively, enhancing the image annotation results. In addition, leveraging external knowledge sources enhances contextuality and significantly enriches existing AIA datasets. We exploit two deep learning-based models, the Relational Graph Convolutional Network (R-GCN) and the Vision Transformer (ViT), to extract the input features. We apply two R-GCN operations to obtain word descriptors and fuse them with the extracted visual features. We evaluate the proposed approach using three public benchmark datasets. Our experimental results demonstrate that the proposed architecture achieves state-of-the-art performance across most metrics on Corel5k, ESP Game, and IAPRTC-12. •Knowledge graph with hyperbolic embeddings enhances automatic image annotation tags.•External knowledge sources, like WordNet, enrich the annotation datasets.•R-GCN and ViT models effectively fuse textual and visual features.•Calibrated predictions refine class probabilities for more reliable annotations.
ISSN:	0262-8856
DOI:	10.1016/j.imavis.2024.105293