Transfer Accent Identification Learning for Enhancing Speech Emotion Recognition
Emotional speech has some dependency on language or within a language itself, there are certain variations due to accents. The presence of accents degrades the performance of the speech emotion recognition (SER) system. A pre-trained accent identification system (AID) could effectively capture the c...
Gespeichert in:
Veröffentlicht in: | Circuits, systems, and signal processing systems, and signal processing, 2024-08, Vol.43 (8), p.5090-5120 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Emotional speech has some dependency on language or within a language itself, there are certain variations due to accents. The presence of accents degrades the performance of the speech emotion recognition (SER) system. A pre-trained accent identification system (AID) could effectively capture the characteristics of accent variations in emotional speech which is an important factor to develop a more reliable SER system. In this work, we investigate the dependencies between accent identification and emotion recognition to enhance the performance of SER. This paper proposes a novel transfer learning-based approach utilizing accent identification knowledge for SER. In the proposed method, the deep neural network (DNN) is used to model the accent identification system, which uses statistical aggregation functions (mean, std, median, etc.,) of spectral subband centroid (SSC) features and Mel-frequency discrete wavelet coefficients (MFDWC). To build the SER, the deep convolutional recurrent autoencoder produces the attention-based latent representation, and the acoustic features are extracted by the openSMILE toolkit. A separate DNN model is used to learn the mapping between attention features and acoustic features for SER. In addition, the a priori knowledge of accent can lead the SER to effect the improvement which is possible through transfer learning (TL). The performance of the proposed method is assessed using the accented emotional speech utterances of the Crema-D dataset and also compared with state-of-the-art techniques. The experimental results show that transferring AID learning improves the recognition rate of the SER and results in around 8% relative improvement in accuracy as compared to the existing SER techniques. |
---|---|
ISSN: | 0278-081X 1531-5878 |
DOI: | 10.1007/s00034-024-02687-1 |