Efficient Feature-Aware Hybrid Model of Deep Learning Architectures for Speech Emotion Recognition

Robust automatic speech emotional-speech recognition architectures based on hybrid convolutional neural networks (CNN) and feedforward deep neural networks are proposed and named in this paper as: BFN, CNA, and HBN. BFN is a combination between bag-of-Audio-word (BoAW) and feedforward deep neural ne...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2021, Vol.9, p.19999-20011
Hauptverfasser:	Ezz-Eldin, Mai, Khalaf, Ashraf A. M., Hamed, Hesham F. A., Hussein, Aziza I.
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Bag-of-acoustic-words convolutional neural network Datasets Deep learning Emotion recognition Feature extraction feedforward deep neural network hybrid features Iron Mel frequency cepstral coefficient Mel frequency cepstral coefficients Neural networks Speech recognition support vector machine Support vector machines Training Voice recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Robust automatic speech emotional-speech recognition architectures based on hybrid convolutional neural networks (CNN) and feedforward deep neural networks are proposed and named in this paper as: BFN, CNA, and HBN. BFN is a combination between bag-of-Audio-word (BoAW) and feedforward deep neural network, CNA based on CNN, finally, HBN is hybrid architecture between BFN and CNA. Overall accuracy is achieved by leveraging Mel-frequency cepstral coefficient features and bag-of-acoustic-words to feed the network, resulting in promising classification performance. In addition, the concatenated output from the proposed hybrid networks is fed into a softmax layer to produce a probability distribution over categorical classifications for speech recognition. The three proposed models are trained on eight emotional classes from the Ryerson Audio-Visual Database of Emotional Speech and Song audio (RAVDESS) dataset. Our proposed models achieved overall precision between 81.5% and 85.5% and overall accuracy between 80.6% and 84.5%, hence outperforming state-of-the-art models using the same dataset.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2021.3054345