A Unimodal Reinforced Transformer With Time Squeeze Fusion for Multimodal Sentiment Analysis

Multimodal sentiment analysis refers to inferring sentiment from language, acoustic, and visual sequences. Previous studies focus on analyzing aligned sequences, while the unaligned sequential analysis is more practical in real-world scenarios. Due to the long-time dependency hidden in the multimoda...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE signal processing letters 2021, Vol.28, p.992-996
Hauptverfasser:	He, Jiaxuan, Mai, Sijie, Hu, Haifeng
Format:	Artikel
Sprache:	eng
Schlagworte:	Analytical models Convolution Data mining Embedding Fuses Kernel multimodal sentiment analysis Sentiment analysis Sequential analysis Sparse matrices Time dependence Time squeeze fusion Transformers unimodal reinforced transformer Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Multimodal sentiment analysis refers to inferring sentiment from language, acoustic, and visual sequences. Previous studies focus on analyzing aligned sequences, while the unaligned sequential analysis is more practical in real-world scenarios. Due to the long-time dependency hidden in the multimodal unaligned sequence and time alignment information is not provided, exploring the time-dependent interactions within unaligned sequences is more challenging. To this end, we introduce the time squeeze fusion to automatically explore the time-dependent interactions by modeling the unimodal and multimodal sequences from the perspective of compressing the time dimension. Moreover, prior methods tend to fuse unimodal features into a multimodal embedding, based on which sentiment is inferred. However, we argue that the unimodal information may be lost or the generated multimodal embedding may be redundant. Addressing this issue, we propose a unimodal reinforced Transformer to progressively attend and distill unimodal information from the multimodal embedding, which enables the multimodal embedding to highlight the discriminative unimodal information. Extensive experiments suggest that our model reaches state-of-the-art performance in terms of accuracy and F1 score on MOSEI dataset.
ISSN:	1070-9908 1558-2361
DOI:	10.1109/LSP.2021.3078074