Learning Multimodal Representations by Symmetrically Transferring Local Structures

Multimodal representations play an important role in multimodal learning tasks, including cross-modal retrieval and intra-modal clustering. However, existing multimodal representation learning approaches focus on building one common space by aligning different modalities and ignore the complementary...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Symmetry (Basel) 2020-09, Vol.12 (9), p.1504
Hauptverfasser:	Dong, Bin, Jian, Songlei, Lu, Kai
Format:	Artikel
Sprache:	eng
Schlagworte:	Alignment Classification Clustering Cognitive tasks Dictionaries Image management Image retrieval Learning local structure Multilayers multimodal representations Neural networks Process parameters Representations Semantics soft metric learning Teaching methods
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Multimodal representations play an important role in multimodal learning tasks, including cross-modal retrieval and intra-modal clustering. However, existing multimodal representation learning approaches focus on building one common space by aligning different modalities and ignore the complementary information across the modalities, such as the intra-modal local structures. In other words, they only focus on the object-level alignment and ignore structure-level alignment. To tackle the problem, we propose a novel symmetric multimodal representation learning framework by transferring local structures across different modalities, namely MTLS. A customized soft metric learning strategy and an iterative parameter learning process are designed to symmetrically transfer local structures and enhance the cluster structures in intra-modal representations. The bidirectional retrieval loss based on multi-layer neural networks is utilized to align two modalities. MTLS is instantiated with image and text data and shows its superior performance on image-text retrieval and image clustering. MTLS outperforms the state-of-the-art multimodal learning methods by up to 32% in terms of R@1 on text-image retrieval and 16.4% in terms of AMI onclustering.
ISSN:	2073-8994 2073-8994
DOI:	10.3390/sym12091504