Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning

Speech emotion recognition is very challenging because the definition of emotion is uncertain and the feature representation is complex. Accurate feature representation is one of the key factors for successful speech emotion recognition. Studies have shown that 3D data composed of static, deltas and...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of signal processing systems 2021-03, Vol.93 (2-3), p.299-308
Hauptverfasser: Huijuan, Zhao, Ning, Ye, Ruchuan, Wang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Speech emotion recognition is very challenging because the definition of emotion is uncertain and the feature representation is complex. Accurate feature representation is one of the key factors for successful speech emotion recognition. Studies have shown that 3D data composed of static, deltas and delta-deltas of log-Mel spectrum is very effective in filtering irrelevant features. The challenge of speech emotion recognition is also reflected in the necessity of fine-grained classification. Typical applications of affective computing, such as psychological counseling and emotion regulation, require fine-grained emotion recognition. Based on the two inspirations, this paper proposes an end-to-end hierarchical multi-task learning framework, from coarse to fine to achieve fine-grained emotion recognition. Using 3D data as input, in the first stage, we train the coarse emotion type, and then use the result to assist the second stage training for the fine emotion type. By conducting the comparative experiments on the IEMOCAP corpus, we find that the classification idea of coarse-to-fine has a significant performance improvement over the baseline models.
ISSN:1939-8018
1939-8115
DOI:10.1007/s11265-020-01538-x