A Comparative Study of Pre-trained Speech and Audio Embeddings for Speech Emotion Recognition
Pre-trained models (PTMs) have shown great promise in the speech and audio domain. Embeddings leveraged from these models serve as inputs for learning algorithms with applications in various downstream tasks. One such crucial task is Speech Emotion Recognition (SER) which has a wide range of applica...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Pre-trained models (PTMs) have shown great promise in the speech and audio
domain. Embeddings leveraged from these models serve as inputs for learning
algorithms with applications in various downstream tasks. One such crucial task
is Speech Emotion Recognition (SER) which has a wide range of applications,
including dynamic analysis of customer calls, mental health assessment, and
personalized language learning. PTM embeddings have helped advance SER,
however, a comprehensive comparison of these PTM embeddings that consider
multiple facets such as embedding model architecture, data used for
pre-training, and the pre-training procedure being followed is missing. A
thorough comparison of PTM embeddings will aid in the faster and more efficient
development of models and enable their deployment in real-world scenarios. In
this work, we exploit this research gap and perform a comparative analysis of
embeddings extracted from eight speech and audio PTMs (wav2vec 2.0, data2vec,
wavLM, UniSpeech-SAT, wav2clip, YAMNet, x-vector, ECAPA). We perform an
extensive empirical analysis with four speech emotion datasets (CREMA-D, TESS,
SAVEE, Emo-DB) by training three algorithms (XGBoost, Random Forest, FCN) on
the derived embeddings. The results of our study indicate that the best
performance is achieved by algorithms trained on embeddings derived from PTMs
trained for speaker recognition followed by wav2clip and UniSpeech-SAT. This
can relay that the top performance by embeddings from speaker recognition PTMs
is most likely due to the model taking up information about numerous speech
features such as tone, accent, pitch, and so on during its speaker recognition
training. Insights from this work will assist future studies in their selection
of embeddings for applications related to SER. |
---|---|
DOI: | 10.48550/arxiv.2304.11472 |