Improving multilingual speech emotion recognition by combining acoustic features in a three-layer model

•We study multilingual speech emotion recognition (mSER) by combined acoustic features in a three-layer perceptual emotion model.•We analyze three vital issues: 1) robust features to mSER; 2) impact of speaker normalization (SN); (3) generalization of mSER to a new language.•Prosody and modulation s...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Speech communication 2019-07, Vol.110, p.1-12
Hauptverfasser: Li, Xingfeng, Akagi, Masato
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•We study multilingual speech emotion recognition (mSER) by combined acoustic features in a three-layer perceptual emotion model.•We analyze three vital issues: 1) robust features to mSER; 2) impact of speaker normalization (SN); (3) generalization of mSER to a new language.•Prosody and modulation spectrum features are studied. Z-normalization forms SN. Cross-speaker and -corpus tasks enhance the robustness of mSER.•The proposed mSER model outperforms previous works. Notably, it allows a comparable result to monolingual SER in a new language without training. This study presents a scheme for multilingual speech emotion recognition. Determining the emotion of speech in general relies upon specific training data, and a different target speaker or language may present significant challenges. In this regard, we first explore 215 acoustic features from emotional speech. Second, we carry out speaker normalization and feature selection to develop a shared standard acoustic parameter set for multiple languages. Third, we use a three-layer model composed of acoustic features, semantic primitives, and emotion dimensions to map acoustics into emotion dimensions. Finally, we classify the continuous emotion dimensional values into basic categories by using the logistic model trees. The proposed approach was tested on Japanese, German, Chinese, and English emotional speech corpora. The recognition performance was examined and enhanced by cross-speaker and cross-corpus evaluation, and stressed the fact that our strategy is particularly suited for the task of multilingual emotion recognition even with a different speaker or language. The experimental results were found to be reasonably comparable with those of monolingual emotion recognizers as a reference.
ISSN:0167-6393
1872-7182
DOI:10.1016/j.specom.2019.04.004