BCRA: bidirectional cross-modal implicit relation reasoning and aligning for text-to-image person retrieval

Text-to-image person retrieval aims to retrieve relevant target individuals based on given textual descriptions. The main challenge faced by this task is how to better combine and align the features of both text and image modalities. Previous efforts have attempted to introduce masked language model...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Multimedia systems 2024-08, Vol.30 (4), Article 177
Hauptverfasser:	Li, Zhaoqi, Xie, Yongping
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Communication Networks Computer Graphics Computer Science Cryptology Data augmentation Data Storage Representation Datasets Image enhancement Masks Modules Multimedia Information Systems Operating Systems Regular Paper Retrieval Visual tasks
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Text-to-image person retrieval aims to retrieve relevant target individuals based on given textual descriptions. The main challenge faced by this task is how to better combine and align the features of both text and image modalities. Previous efforts have attempted to introduce masked language model (MLM) to implicitly enhance the capability of multimodal representation, making some progress. However, masked image model (MIM) seems to be underestimated in this task. Therefore, we propose BCRA: a bidirectional cross modal implicit relationship inference and alignment framework, introducing MIM as a supplement to MLM tasks. Firstly, we integrate the tasks of MIM and MLM. Building upon this foundation, in order to enhance multimodal interaction, we further investigated the impact of global/local visual features on MLM tasks and constructed a new cross attention module. Additionally, we observe that image masks and language masks themselves serve as powerful means for data augmentation. We directly employ masked data from other modules during model training, engaging in cross-modal multi-view learning. The introduction of bidirectional mask strategy features in conjunction with other modules improves the accuracy and robustness of the model. The proposed approach achieves state-of-the-art results on all three public datasets, and compared to existing methods, it has the advantages of fast speed, fewer parameters, and no dependence on additional datasets.
ISSN:	0942-4962 1432-1882
DOI:	10.1007/s00530-024-01372-2