Effects of Data Augmentations on Speech Emotion Recognition

Data augmentation techniques have recently gained more adoption in speech processing, including speech emotion recognition. Although more data tend to be more effective, there may be a trade-off in which more data will not provide a better model. This paper reports experiments on investigating the e...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Sensors (Basel, Switzerland) Switzerland), 2022-08, Vol.22 (16), p.5941
Hauptverfasser:	Atmaja, Bagus Tris, Sasou, Akira
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy affective computing Data augmentation data augmentations Datasets Emotion recognition Emotions Humans Linguistics Neural networks Perception Performance evaluation Speech speech emotion recognition Speech processing Speech recognition SVM Text Messaging Tradeoffs Voice recognition wav2vec 2.0
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Data augmentation techniques have recently gained more adoption in speech processing, including speech emotion recognition. Although more data tend to be more effective, there may be a trade-off in which more data will not provide a better model. This paper reports experiments on investigating the effects of data augmentation in speech emotion recognition. The investigation aims at finding the most useful type of data augmentation and the number of data augmentations for speech emotion recognition in various conditions. The experiments are conducted on the Japanese Twitter-based emotional speech and IEMOCAP datasets. The results show that for speaker-independent data, two data augmentations with glottal source extraction and silence removal exhibited the best performance among others, even with more data augmentation techniques. For the text-independent data (including speaker and text-independent), more data augmentations tend to improve speech emotion recognition performances. The results highlight the trade-off between the number of data augmentations and the performance of speech emotion recognition showing the necessity to choose a proper data augmentation technique for a specific condition.
ISSN:	1424-8220 1424-8220
DOI:	10.3390/s22165941