Deep neural network architectures for audio emotion recognition performed on song and speech modalities

Audio emotion recognition is a very active topic over the last decade. The emotions emanating from singing and the emotions emanating from speech treated separately because of their different signal characteristics. To deal with it, convolution neural network (CNN) and recurrent neural network (RNN)...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of speech technology 2023-12, Vol.26 (4), p.1165-1181
Hauptverfasser: Ayadi, Souha, Lachiri, Zied
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Audio emotion recognition is a very active topic over the last decade. The emotions emanating from singing and the emotions emanating from speech treated separately because of their different signal characteristics. To deal with it, convolution neural network (CNN) and recurrent neural network (RNN) seems to be the most successful kinds of neural networks who are the most relevant in this field. However, the models format varies remarkably for image classification as well as for audio classification, it depends on the task being processed. Therefore, since we are working on audio data, we build three different models by processing audio data using the same techniques applied on text and image data. The goal is to create a technique producing several models adapted to the nature of the database. Additionally, process audio data as images to improve the process as well as the accuracy of the results. In this paper, we present three different models Conv1D, Conv2D and LSTM, leading to a fourth model which recombines the structure of CNN and LSTM. The models are carried out on the RAVDESS dataset. The main purpose is to take advantage of the best criteria of each neural network model by creating a technique that avoids errors as much as possible in order to create models suitable for both Audio Song and Audio Speech. The process goes as follows: First, starting with the use of mel-frequency cepstrum coefficients (MFCC) for feature extraction. Second, create an architecture based on each neural network model and pass the softmax for classification. Third, concatenate the first architecture with another to improve the results. This proposed technique solves the overfitting problem, provides some stability of the performance and improves accuracy results. After evaluating the performance of each model, a fourth model is presented that concatenates both CNN and LSTM, which follows the same process. The results show that the proposed models are applicable to the state-of-the-art methods.
ISSN:1381-2416
1572-8110
DOI:10.1007/s10772-023-10079-0