Collaborative fine-grained interaction learning for image–text sentiment analysis
Investigating interactions between image and text can effectively improve image–text sentiment analysis, but most existing methods do not explore image–text interaction at fine-grained level. In this paper, we propose a Memory-enhanced Collaborative Fine-grained Interaction Transformer (MCFIT) to le...
Gespeichert in:
Veröffentlicht in: | Knowledge-based systems 2023-11, Vol.279, p.110951, Article 110951 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Investigating interactions between image and text can effectively improve image–text sentiment analysis, but most existing methods do not explore image–text interaction at fine-grained level. In this paper, we propose a Memory-enhanced Collaborative Fine-grained Interaction Transformer (MCFIT) to learn collaborative fine-grained interaction between image and text. Specifically, a multi-branch encoder is designed to learn both fine-grained region-word and patch-word interactions. Meanwhile, Memory-enhanced Cross-Attention (MECA) is proposed to utilize patch and region information to improve region-word interaction and patch-word interaction learning, respectively. Therefore, collaborative fine-grained interaction can yield more accurate image–text interaction. Finally, to analyze the sentiments embedded in real-life Chinese image–text pairs, we build a large-scale Chinese image–text sentiment dataset (CISD) containing 54,931 image–text pairs. Extensive experiments conducted on four real-life datasets prove the effectiveness of collaborative fine-grained interaction and demonstrate that MCFIT outperforms the state-of-the-art baselines.
•A large-scale Chinese image–text dataset including 54,931 Chinese image–text pairs is reported.•Collaborative fine-grained interaction between image and text is proposed.•Memory-enhanced Cross-Attention is designed to achieve collaborative finegrained interaction.•Experiments conducted on four real-life image–text datasets prove the validity of the proposed method. |
---|---|
ISSN: | 0950-7051 1872-7409 |
DOI: | 10.1016/j.knosys.2023.110951 |