Multi-Scale Fine-Grained Alignments for Image and Sentence Matching

Image and sentence matching is a critical task to bridge the visual and textual discrepancy due to the heterogeneous modalities. Great progress has been made by exploring the coarse-grained relationships between images and sentences or fine-grained relationships between regions and words. However, h...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2023, Vol.25, p.543-556
Hauptverfasser:	Li, Wenhui, Wang, Yan, Su, Yuting, Li, Xuanya, Liu, An-An, Zhang, Yongdong
Format:	Artikel
Sprache:	eng
Schlagworte:	Bi-directional aggregations Bridges Dogs Feature extraction image and sentence matching Matching Modules Mouth multi-scale alignments Semantics Sentences Task analysis Visual tasks Visualization Words (language)
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Image and sentence matching is a critical task to bridge the visual and textual discrepancy due to the heterogeneous modalities. Great progress has been made by exploring the coarse-grained relationships between images and sentences or fine-grained relationships between regions and words. However, how to fully excavate and exploit corresponding relations between these two modalities is still challenging. In this work, we propose a novel Multi-scale Fine-grained Alignments Network (MFA), which can effectively explore multi-scale visual-textual correspondences to facilitate bridging cross-modal discrepancy. Specifically, word-scale matching module is firstly utilized to mine the basic but fundamental correspondences between a single word and independent region. Then, we propose a phrase-scale matching module to explore the relations between objects with the constraint of attribute and corresponding region, which can further reserve more associated information. To cope with the complex interactions among multiple phrases and images, we design the relation-scale matching module to capture high-order semantics between two modalities. Moreover, each matching module includes visual aggregation and textual aggregations, which can ensure the bi-directional coupling of multi-scale semantics. Extensive qualitative and quantitative experiments on two challenging datasets including Flickr30 K and MSCOCO, show that the proposed method achieves superior performance compared with the existing methods.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2021.3128744