BCRA: bidirectional cross-modal implicit relation reasoning and aligning for text-to-image person retrieval
Text-to-image person retrieval aims to retrieve relevant target individuals based on given textual descriptions. The main challenge faced by this task is how to better combine and align the features of both text and image modalities. Previous efforts have attempted to introduce masked language model...
Gespeichert in:
Veröffentlicht in: | Multimedia systems 2024-08, Vol.30 (4), Article 177 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Text-to-image person retrieval aims to retrieve relevant target individuals based on given textual descriptions. The main challenge faced by this task is how to better combine and align the features of both text and image modalities. Previous efforts have attempted to introduce masked language model (MLM) to implicitly enhance the capability of multimodal representation, making some progress. However, masked image model (MIM) seems to be underestimated in this task. Therefore, we propose BCRA: a bidirectional cross modal implicit relationship inference and alignment framework, introducing MIM as a supplement to MLM tasks. Firstly, we integrate the tasks of MIM and MLM. Building upon this foundation, in order to enhance multimodal interaction, we further investigated the impact of global/local visual features on MLM tasks and constructed a new cross attention module. Additionally, we observe that image masks and language masks themselves serve as powerful means for data augmentation. We directly employ masked data from other modules during model training, engaging in cross-modal multi-view learning. The introduction of bidirectional mask strategy features in conjunction with other modules improves the accuracy and robustness of the model. The proposed approach achieves state-of-the-art results on all three public datasets, and compared to existing methods, it has the advantages of fast speed, fewer parameters, and no dependence on additional datasets. |
---|---|
ISSN: | 0942-4962 1432-1882 |
DOI: | 10.1007/s00530-024-01372-2 |