A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face

Multimodal emotion recognition (MER) refers to the identification and understanding of human emotional states by combining different signals, including—but not limited to—text, speech, and face cues. MER plays a crucial role in the human–computer interaction (HCI) domain. With the recent progression...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Entropy (Basel, Switzerland) Switzerland), 2023-10, Vol.25 (10), p.1440
Hauptverfasser:	Lian, Hailun, Lu, Cheng, Li, Sunan, Zhao, Yan, Tang, Chuangao, Zong, Yuan
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Data collection Datasets Deep learning Emotion recognition Emotion regulation Emotional factors Emotions Entropy Feature extraction fusion method Human-computer interface Machine learning Motion capture multimodal emotion recognition Research methodology Researchers Review Speech recognition survey Surveys
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Multimodal emotion recognition (MER) refers to the identification and understanding of human emotional states by combining different signals, including—but not limited to—text, speech, and face cues. MER plays a crucial role in the human–computer interaction (HCI) domain. With the recent progression of deep learning technologies and the increasing availability of multimodal datasets, the MER domain has witnessed considerable development, resulting in numerous significant research breakthroughs. However, a conspicuous absence of thorough and focused reviews on these deep learning-based MER achievements is observed. This survey aims to bridge this gap by providing a comprehensive overview of the recent advancements in MER based on deep learning. For an orderly exposition, this paper first outlines a meticulous analysis of the current multimodal datasets, emphasizing their advantages and constraints. Subsequently, we thoroughly scrutinize diverse methods for multimodal emotional feature extraction, highlighting the merits and demerits of each method. Moreover, we perform an exhaustive analysis of various MER algorithms, with particular focus on the model-agnostic fusion methods (including early fusion, late fusion, and hybrid fusion) and fusion based on intermediate layers of deep models (encompassing simple concatenation fusion, utterance-level interaction fusion, and fine-grained interaction fusion). We assess the strengths and weaknesses of these fusion strategies, providing guidance to researchers to help them select the most suitable techniques for their studies. In summary, this survey aims to provide a thorough and insightful review of the field of deep learning-based MER. It is intended as a valuable guide to aid researchers in furthering the evolution of this dynamic and impactful field.
ISSN:	1099-4300 1099-4300
DOI:	10.3390/e25101440