A Rearrangement and Restore-Based Mixer Model for Target-Oriented Multimodal Sentiment Classification

With the development of fine-grained multimodal sentiment analysis tasks, target-oriented multimodal sentiment (TMSC) analysis has received more attention, which aims to classify the sentiment of target with the help of textual and associated image features. Existing methods focus on exploring fine-...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on artificial intelligence 2024-06, Vol.5 (6), p.3109-3119
Hauptverfasser:	Jia, Li, Ma, Tinghuai, Rong, Huan, Sheng, Victor S., Huang, Xuejian, Xie, Xintong
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial intelligence Feature extraction Feature mixing Image restoration Mixers MLPs-based rearrangement and restore operations target-oriented multimodal sentiment classification Task analysis Transformers Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	With the development of fine-grained multimodal sentiment analysis tasks, target-oriented multimodal sentiment (TMSC) analysis has received more attention, which aims to classify the sentiment of target with the help of textual and associated image features. Existing methods focus on exploring fine-grained image features and incorporate transformer-based complex fusion strategies, while ignoring the heavy computational burden. Recently, some lightweight multilayer perceptrons (MLP)-based methods have been successfully applied to multimodal sentiment classification tasks. In this article, we propose an effective rearrangement and restore mixer model (RR-Mixer) for TMSC, which dedicates the interaction of image, text, and targets along the modal-axis , sequential-axis , and feature channel-axis through rearrangement and restore operations. Specifically, we take vision transformer (ViT) and robustly optimized BERT (RoBERTa) pretrained models to extract image and textual features, respectively. Further, we adopt cosine similarity to select the most semantically relevant image features. Then, an RR-Mixer module is designed for mixed multimodal features, with the core technology consisting of rolling, grouping rearrangement and restore operations. Moreover, we introduce MLP Unit to learn the information of different modalities for intermodal interaction. The results show that our model achieves superior performance on two benchmark multimodal datasets, TWITTER-15 and TWITTER-17, with a significant improvement of 4.66% and 1.26% in terms of macro-F1.
ISSN:	2691-4581 2691-4581
DOI:	10.1109/TAI.2023.3341879