Can We Exploit All Datasets? Multimodal Emotion Recognition Using Cross-Modal Translation

The use of sufficiently large datasets is important for most deep learning tasks, and emotion recognition tasks are no exception. Multimodal emotion recognition is the task of considering multiple types of modalities simultaneously to improve accuracy and robustness, typically utilizing three modali...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2022, Vol.10, p.64516-64524
1. Verfasser:	Yoon, Yeo Chan
Format:	Artikel
Sprache:	eng
Schlagworte:	Cognitive tasks Data models Datasets Deep learning Emotion recognition Emotions generative adversarial networks Machine learning multimodal emotion recognition Performance enhancement Robustness Semantics Task analysis Training Transformers Visualization
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The use of sufficiently large datasets is important for most deep learning tasks, and emotion recognition tasks are no exception. Multimodal emotion recognition is the task of considering multiple types of modalities simultaneously to improve accuracy and robustness, typically utilizing three modalities: visual, audio, and text. Similar to other deep learning tasks, large datasets are required. Various heterogeneous datasets exist, including unimodal datasets constructed for traditional unimodal recognition and bimodal or trimodal datasets for multi-modal emotion recognition. A trimodal emotion recognition model shows high performance and robustness by comprehensively considering multiple modalities. However, the use of unimodal or bimodal datasets in this case is problematic. In this study, we propose a novel method to improve the performance of emotion recognition based on a cross-modal translator that can translate between the three modalities. The proposed method can train a multimodal model based on three modalities with different types of heterogeneous datasets, and the dataset does not require alignment between modalities: visual, audio, and text. We achieved a high performance exceeding the baseline in CMU-MOSEI and IEMOCAP, which are representative multimodal datasets, by adding unimodal and bimodal datasets to the trimodal dataset.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2022.3183587