AAD-Net: Advanced end-to-end signal processing system for human emotion detection & recognition using attention-based deep echo state network
Speech signals are the most convenient way of communication between human beings and the eventual method of Human–Computer Interaction (HCI) to exchange emotions and information. Recognizing emotions from speech signals is a challenging task due to the sparse nature of emotional data and features. I...
Gespeichert in:
Veröffentlicht in: | Knowledge-based systems 2023-06, Vol.270, p.110525, Article 110525 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Speech signals are the most convenient way of communication between human beings and the eventual method of Human–Computer Interaction (HCI) to exchange emotions and information. Recognizing emotions from speech signals is a challenging task due to the sparse nature of emotional data and features. In this article, we proposed a Deep Echo-State-Network (DeepESN) system for emotion recognition with a dilated convolution neural network and multi-headed attention mechanism. To reduce the model complexity, we incorporate a DeepESN that combines reservoir computing for higher-dimensional mapping. We also used fine-tuned Sparse Random Projection (SRP) to reduce dimensionality and adopted an early fusion strategy to fuse the extracted cues and passed the joint feature vector via a classification layer to recognize emotions. Our proposed model is evaluated on two public speech corpora, EMO-DB and RAVDESS, and tested for subject/speaker-dependent/independent performance. The results show that our proposed system achieves a high recognition rate, 91.14, 85.57 for EMO-DB, and 82.01, 77.02 for RAVDESS, using speaker-dependent and independent experiments, respectively. Our proposed system outperforms the State-of-The-Art (SOTA) while requiring less computational time.
•The model introduced output feedback into the state update equation for fast learning.•Proposed an algorithm to determined the number of ESN-layers to existence methods.•Indorsed the compatibility & reliability of DeepESN for human emotion recognition.•Used attention block to deal with salient information of speech signal amplitude for SER.•Results indicated the significance of the proposed model over baseline methods. |
---|---|
ISSN: | 0950-7051 1872-7409 |
DOI: | 10.1016/j.knosys.2023.110525 |