Speech emotion recognition with transfer learning and multi-condition training for noisy environments

This paper explores the use of transfer learning techniques to develop robust speech emotion recognition (SER) models capable of handling noise in real-world environments. Two SER frameworks have been proposed in this work: Framework-1 is a two-stage framework that involves retraining pretrained net...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of speech technology 2024, Vol.27 (2), p.353-365
Hauptverfasser: Haque, Arijul, Rao, Krothapalli Sreenivasa
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This paper explores the use of transfer learning techniques to develop robust speech emotion recognition (SER) models capable of handling noise in real-world environments. Two SER frameworks have been proposed in this work: Framework-1 is a two-stage framework that involves retraining pretrained networks on clean data in the first stage followed by fine-tuning the network further with noisy data in the second stage, while Framework-2 directly retrains pretrained networks on multi-conditioned noisy data. To create multi-conditioned data, we have used both natural noise recordings and trance music under a single augmentation framework. Three pre-trained models (AlexNet, GoogleNet, VGG19) are evaluated on two datasets (IEMOCAP and IITKGP-SEHSC) using bottleneck features and quantized bottleneck features (only in the test phase) for noise mitigation. The experiments involve retraining the last one or two layers or using an SVM classifier on the bottleneck features. The results reveal that GoogleNet and VGG19 outperform AlexNet, and fine-tuning the final two layers of these models achieves the highest accuracy. Additionally, quantized bottleneck features further improve performance. Most importantly, Framework-2 consistently outperforms Framework-1 in most cases. While comparisons with existing work are challenging due to widely varying experimental settings in related works, the findings demonstrate competitive performance. A major novelty in this work lies in the variety of SNR conditions explored and the use of trance music for creating multi-conditioned noisy data.
ISSN:1381-2416
1572-8110
DOI:10.1007/s10772-024-10109-5