Collaborative fine-grained interaction learning for image–text sentiment analysis

Investigating interactions between image and text can effectively improve image–text sentiment analysis, but most existing methods do not explore image–text interaction at fine-grained level. In this paper, we propose a Memory-enhanced Collaborative Fine-grained Interaction Transformer (MCFIT) to le...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Knowledge-based systems 2023-11, Vol.279, p.110951, Article 110951
Hauptverfasser: Xiao, Xingwang, Pu, Yuanyuan, Zhou, Dongming, Cao, Jinde, Gu, Jinjing, Zhao, Zhengpeng, Xu, Dan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Investigating interactions between image and text can effectively improve image–text sentiment analysis, but most existing methods do not explore image–text interaction at fine-grained level. In this paper, we propose a Memory-enhanced Collaborative Fine-grained Interaction Transformer (MCFIT) to learn collaborative fine-grained interaction between image and text. Specifically, a multi-branch encoder is designed to learn both fine-grained region-word and patch-word interactions. Meanwhile, Memory-enhanced Cross-Attention (MECA) is proposed to utilize patch and region information to improve region-word interaction and patch-word interaction learning, respectively. Therefore, collaborative fine-grained interaction can yield more accurate image–text interaction. Finally, to analyze the sentiments embedded in real-life Chinese image–text pairs, we build a large-scale Chinese image–text sentiment dataset (CISD) containing 54,931 image–text pairs. Extensive experiments conducted on four real-life datasets prove the effectiveness of collaborative fine-grained interaction and demonstrate that MCFIT outperforms the state-of-the-art baselines. •A large-scale Chinese image–text dataset including 54,931 Chinese image–text pairs is reported.•Collaborative fine-grained interaction between image and text is proposed.•Memory-enhanced Cross-Attention is designed to achieve collaborative finegrained interaction.•Experiments conducted on four real-life image–text datasets prove the validity of the proposed method.
ISSN:0950-7051
1872-7409
DOI:10.1016/j.knosys.2023.110951