Transforming the Embeddings: A Lightweight Technique for Speech Emotion Recognition Tasks
Speech emotion recognition (SER) is a field that has drawn a lot of attention due to its applications in diverse fields. A current trend in methods used for SER is to leverage embeddings from pre-trained models (PTMs) as input features to downstream models. However, the use of embeddings from speake...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Speech emotion recognition (SER) is a field that has drawn a lot of attention
due to its applications in diverse fields. A current trend in methods used for
SER is to leverage embeddings from pre-trained models (PTMs) as input features
to downstream models. However, the use of embeddings from speaker recognition
PTMs hasn't garnered much focus in comparison to other PTM embeddings. To fill
this gap and in order to understand the efficacy of speaker recognition PTM
embeddings, we perform a comparative analysis of five PTM embeddings. Among
all, x-vector embeddings performed the best possibly due to its training for
speaker recognition leading to capturing various components of speech such as
tone, pitch, etc. Our modeling approach which utilizes x-vector embeddings and
mel-frequency cepstral coefficients (MFCC) as input features is the most
lightweight approach while achieving comparable accuracy to previous
state-of-the-art (SOTA) methods in the CREMA-D benchmark. |
---|---|
DOI: | 10.48550/arxiv.2305.18640 |