Efficient Feature-Aware Hybrid Model of Deep Learning Architectures for Speech Emotion Recognition

Robust automatic speech emotional-speech recognition architectures based on hybrid convolutional neural networks (CNN) and feedforward deep neural networks are proposed and named in this paper as: BFN, CNA, and HBN. BFN is a combination between bag-of-Audio-word (BoAW) and feedforward deep neural ne...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2021, Vol.9, p.19999-20011
Hauptverfasser: Ezz-Eldin, Mai, Khalaf, Ashraf A. M., Hamed, Hesham F. A., Hussein, Aziza I.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Robust automatic speech emotional-speech recognition architectures based on hybrid convolutional neural networks (CNN) and feedforward deep neural networks are proposed and named in this paper as: BFN, CNA, and HBN. BFN is a combination between bag-of-Audio-word (BoAW) and feedforward deep neural network, CNA based on CNN, finally, HBN is hybrid architecture between BFN and CNA. Overall accuracy is achieved by leveraging Mel-frequency cepstral coefficient features and bag-of-acoustic-words to feed the network, resulting in promising classification performance. In addition, the concatenated output from the proposed hybrid networks is fed into a softmax layer to produce a probability distribution over categorical classifications for speech recognition. The three proposed models are trained on eight emotional classes from the Ryerson Audio-Visual Database of Emotional Speech and Song audio (RAVDESS) dataset. Our proposed models achieved overall precision between 81.5% and 85.5% and overall accuracy between 80.6% and 84.5%, hence outperforming state-of-the-art models using the same dataset.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2021.3054345