Negative-Sensitive Framework With Semantic Enhancement for Composed Image Retrieval

Composed image retrieval is a challenging task in the field of multi-modal learning, aiming at measuring the similarities between target images and query images with modification sentences. Most previous methods either construct feature composition for the query image and modification text or concen...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2024, Vol.26, p.7608-7621
Hauptverfasser:	Wang, Yifan, Liu, Liyuan, Yuan, Chun, Li, Minbo, Liu, Jing
Format:	Artikel
Sprache:	eng
Schlagworte:	Adaptive sampling attention mechanism Composed image retrieval Correlation cross-modal retrieval distribution learning Filtration Image enhancement Image retrieval Learning Optimization Queries Retrieval semantic learning Semantics Similarity Task analysis Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Composed image retrieval is a challenging task in the field of multi-modal learning, aiming at measuring the similarities between target images and query images with modification sentences. Most previous methods either construct feature composition for the query image and modification text or concentrate on extracting cross-modal alignments. However, these methods are prone to neglect the negative impacts of the mismatched correspondences between the hybrid-modal query and target, which could be discriminative when comparing similar instances. Besides, localized textual representations are not fully explored when learning similarities between the query and the target. To overcome the above issues, we propose a Negative-Sensitive Framework with Semantic Enhancement (NSFSE) for mining the adaptive boundaries between matched and mismatched samples with comprehensive consideration of positive and negative correspondences. It can optimize the threshold dynamically based on distributions to explore the intrinsic characteristics of positive and negative correlations, which could further facilitate accurate similarity learning. A text-guided attention mechanism after infusing cross-modal affinities on localized word features is exploited in NSFSE to explore latent semantic-related visual similarity and cross-modal similarity simultaneously. The performance of extensive experiments and comprehensive analysis on three representative datasets CIRR, FashionIQ, and Fashion200 K demonstrate the effectiveness of negative mining of similarity with semantic enhancement in the proposed NSFSE.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2024.3369898