StyleBERT: Text-audio sentiment analysis with Bi-directional Style Enhancement

Recent multimodal sentiment analysis works focus on establishing sophisticated fusion strategies for better performance. However, a major limitation of these works is that they ignore effective modality representation learning before fusion. In this work, we propose a novel text-audio sentiment anal...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information systems (Oxford) 2023-03, Vol.114, p.102147, Article 102147
Hauptverfasser: Lin, Fei, Liu, Shengqiang, Zhang, Cong, Fan, Jin, Wu, Zizhao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Recent multimodal sentiment analysis works focus on establishing sophisticated fusion strategies for better performance. However, a major limitation of these works is that they ignore effective modality representation learning before fusion. In this work, we propose a novel text-audio sentiment analysis framework, named StyleBERT, to enhance the emotional information of unimodal representations by learning distinct modality styles, such that the model already obtains an effective unimodal representation before fusion, which mitigates the reliance on fusion. In particular, we propose a Bi-directional Style Enhancement module, which learns one contextualized style representation and two differentiated style representations for each modality, where the relevant semantic information across modalities and the discriminative characteristics of each modality will be captured. Furthermore, to learn fine-grained acoustic representation, we only use the directly available Log-Mel spectrograms as audio modality inputs and encode it with a multi-head self-attention mechanism. Comprehensive experimental results on three widely-used benchmark datasets demonstrate that the proposed StyleBERT is an effective multimodal framework and significantly outperforms the state-of-the-art multimodal baselines. Our code is available at https://github.com/lsq960124/StyleBERT. •Before multimodal fusion, we must perform modality representation learning.•Learning distinct modality style representations to enhance the representation ability of modality itself.•Comparable sentiment analysis results can be achieved using only directly available Log-Mel spectrograms.
ISSN:0306-4379
1873-6076
DOI:10.1016/j.is.2022.102147