Multi-Level Knowledge Distillation for Speech Emotion Recognition in Noisy Conditions
Speech emotion recognition (SER) performance deteriorates significantly in the presence of noise, making it challenging to achieve competitive performance in noisy conditions. To this end, we propose a multi-level knowledge distillation (MLKD) method, which aims to transfer the knowledge from a teac...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Speech emotion recognition (SER) performance deteriorates significantly in
the presence of noise, making it challenging to achieve competitive performance
in noisy conditions. To this end, we propose a multi-level knowledge
distillation (MLKD) method, which aims to transfer the knowledge from a teacher
model trained on clean speech to a simpler student model trained on noisy
speech. Specifically, we use clean speech features extracted by the wav2vec-2.0
as the learning goal and train the distil wav2vec-2.0 to approximate the
feature extraction ability of the original wav2vec-2.0 under noisy conditions.
Furthermore, we leverage the multi-level knowledge of the original wav2vec-2.0
to supervise the single-level output of the distil wav2vec-2.0. We evaluate the
effectiveness of our proposed method by conducting extensive experiments using
five types of noise-contaminated speech on the IEMOCAP dataset, which show
promising results compared to state-of-the-art models. |
---|---|
DOI: | 10.48550/arxiv.2312.13556 |