Multi-Modal Memory Enhancement Attention Network for Image-Text Matching

Image-text matching is an attractive research topic in the community of vision and language. The key element to narrow the "heterogeneity gap" between visual and textual data lies in how to learn powerful and robust representations for both modalities. This paper proposes to alleviate this...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2020, Vol.8, p.38438-38447
Hauptverfasser: Ji, Zhong, Lin, Zhigang, Wang, Haoran, He, Yuqing
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Image-text matching is an attractive research topic in the community of vision and language. The key element to narrow the "heterogeneity gap" between visual and textual data lies in how to learn powerful and robust representations for both modalities. This paper proposes to alleviate this issue to achieve the fine-grained visual-textual alignment from two aspects: exploiting attention mechanism to locate the semantically meaningful portion and leveraging the memory network to capture the long-term contextual knowledge. Unlike most existing studies sorely focus on exploring the cross-modal associations at the fragment level, our designed Collaborative Dual Attention (CDA) module is able to model the semantic interdependencies from both perspectives of fragment and channel. Furthermore, considering the usage of long-term contextual knowledge contributes to compensate for detailed semantics concealed in the rarely appeared image-text pairs, we present to learn the joint representations by constructing a Multi-Modal Memory Enhancement (M3E) module. Specifically, it sequentially restores the intra-modal and multi-modal information into the memory items, and they conversely persistently memorize cross-modal shared semantics to improve the latent embeddings. By incorporating both CDA and M3E modules into a deep architecture, our approach generates more semantically consistent embeddings for representing images and texts. Extensive experiments demonstrate our model can achieve the state-of-the-art results on two public benchmark datasets.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2020.2975594