Cross-modal Image Retrieval Considering Semantic Relationships with Many-to-many Correspondence Loss

A cross-modal image retrieval that explicitly considers semantic relationships between images and texts is proposed. Most conventional cross-modal image retrieval methods retrieve the target images by directly measuring the similarities between the candidate images and query texts in a common semant...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2023-01, Vol.11, p.1-1
Hauptverfasser:	Zhang, Huaying, Yanagi, Rintaro, Togo, Ren, Ogawa, Takahiro, Haseyama, Miki
Format:	Artikel
Sprache:	eng
Schlagworte:	Convolutional neural networks Cross-modal image retrieval Embedding Extraterrestrial measurements Image retrieval many-to-many correspondences Measurement multimedia information retrieval Recurrent neural networks semantic similarity Semantics Similarity Texts Training data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A cross-modal image retrieval that explicitly considers semantic relationships between images and texts is proposed. Most conventional cross-modal image retrieval methods retrieve the target images by directly measuring the similarities between the candidate images and query texts in a common semantic embedding space. However, such methods tend to focus on a one-to-one correspondence between a predefined image-text pair during the training phase, and other semantically similar images and texts are ignored. By considering the many-to-many correspondences between semantically similar images and texts, a common embedding space is constructed to assure semantic relationships, which allows users to accurately find more images that are related to the input query texts. Thus, in this paper, we propose a cross-modal image retrieval method that considers semantic relationships between images and texts. The proposed method calculates the similarities between texts as semantic similarities to acquire the relationships. Then, we introduce a loss function that explicitly constructs the many-to-many correspondences between semantically similar images and texts from their semantic relationships. We also propose an evaluation metric to assess whether each method can construct an embedding space considering the semantic relationships. Experimental results demonstrate that the proposed method outperforms conventional methods in terms of this newly proposed metric.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2023.3239858