Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge
Speech Self-Supervised Learning (SSL) has demonstrated considerable efficacy in various downstream tasks. Nevertheless, prevailing self-supervised models often overlook the incorporation of emotion-related prior information, thereby neglecting the potential enhancement of emotion task comprehension...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Speech Self-Supervised Learning (SSL) has demonstrated considerable efficacy
in various downstream tasks. Nevertheless, prevailing self-supervised models
often overlook the incorporation of emotion-related prior information, thereby
neglecting the potential enhancement of emotion task comprehension through
emotion prior knowledge in speech. In this paper, we propose an emotion-aware
speech representation learning with intensity knowledge. Specifically, we
extract frame-level emotion intensities using an established speech-emotion
understanding model. Subsequently, we propose a novel emotional masking
strategy (EMS) to incorporate emotion intensities into the masking process. We
selected two representative models based on Transformer and CNN, namely
MockingJay and Non-autoregressive Predictive Coding (NPC), and conducted
experiments on IEMOCAP dataset. Experiments have demonstrated that the
representations derived from our proposed method outperform the original model
in SER task. |
---|---|
DOI: | 10.48550/arxiv.2406.06646 |