Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts

A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of multimedia information retrieval 2024-03, Vol.13 (1), p.14, Article 14
Hauptverfasser:	Zhang, Huaying, Yanagi, Rintaro, Togo, Ren, Ogawa, Takahiro, Haseyama, Miki
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Computer Science Data Mining and Knowledge Discovery Database Management Datasets Embedding Image Processing and Computer Vision Image retrieval Information Storage and Retrieval Information Systems Applications (incl.Internet) Mathematical models Methods Multimedia Information Systems Neural networks Parameters Regular Paper Retrieval Semantics Texts Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.
ISSN:	2192-6611 2192-662X
DOI:	10.1007/s13735-024-00322-y