A two-channel speech emotion recognition model based on raw stacked waveform

To improve the accuracy and efficiency of speech emotion recognition (SER), the acoustic feature set and speech emotion recognition model was designed based on the original speech signal, and explored the nonlinear relationship between acoustic features, the speech emotion recognition model, and the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Multimedia tools and applications 2022-03, Vol.81 (8), p.11537-11562
Hauptverfasser: Zheng, Chunjun, Wang, Chunli, Jia, Ning
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:To improve the accuracy and efficiency of speech emotion recognition (SER), the acoustic feature set and speech emotion recognition model was designed based on the original speech signal, and explored the nonlinear relationship between acoustic features, the speech emotion recognition model, and the recognition task. Moreover, the original features of speech signals were studied rather than the traditional statistical features. A joint two-channel model was proposed based on the raw stacked waveform. To model raw waveform features, the convolutional recurrent neural network (CRNN) and bi-directional long short-term memory (BiLSTM) were introduced. An attention mechanism was integrated into the model to ensure that a single channel could learn the expression of the salient local region and global emotion features. Through these channels, the perception ability of speech acoustic features in multi-scale is improved, and the internal correlation between salient region and convolutional neural network is explored. The time domain and frequency domain features of speech are prominent, and the local expression of emotion is emphasized. Based on the preprocessing strategy of background separation and dimension unification, the convolutional recurrent neural network is used to extract global information. The proposed joint model could effectively integrate the advantages of the two channels. Several comparative experiments were conducted on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database. The experiments results showed that the proposed two-channel SER model could improve recognition accuracy (UA) by 5.1% and the convergence period was shortened by 58%, compared with the popular models. Furthermore, it performed best in solving data skew and improving efficiency, which proved the importance of having features and models based on the raw waveform.
ISSN:1380-7501
1573-7721
DOI:10.1007/s11042-022-12378-1