Multi-Scale Fine-Grained Alignments for Image and Sentence Matching
Image and sentence matching is a critical task to bridge the visual and textual discrepancy due to the heterogeneous modalities. Great progress has been made by exploring the coarse-grained relationships between images and sentences or fine-grained relationships between regions and words. However, h...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on multimedia 2023, Vol.25, p.543-556 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Image and sentence matching is a critical task to bridge the visual and textual discrepancy due to the heterogeneous modalities. Great progress has been made by exploring the coarse-grained relationships between images and sentences or fine-grained relationships between regions and words. However, how to fully excavate and exploit corresponding relations between these two modalities is still challenging. In this work, we propose a novel Multi-scale Fine-grained Alignments Network (MFA), which can effectively explore multi-scale visual-textual correspondences to facilitate bridging cross-modal discrepancy. Specifically, word-scale matching module is firstly utilized to mine the basic but fundamental correspondences between a single word and independent region. Then, we propose a phrase-scale matching module to explore the relations between objects with the constraint of attribute and corresponding region, which can further reserve more associated information. To cope with the complex interactions among multiple phrases and images, we design the relation-scale matching module to capture high-order semantics between two modalities. Moreover, each matching module includes visual aggregation and textual aggregations, which can ensure the bi-directional coupling of multi-scale semantics. Extensive qualitative and quantitative experiments on two challenging datasets including Flickr30 K and MSCOCO, show that the proposed method achieves superior performance compared with the existing methods. |
---|---|
ISSN: | 1520-9210 1941-0077 |
DOI: | 10.1109/TMM.2021.3128744 |