Stuttering Detection Using Speaker Representations and Self-supervised Contextual Embeddings
The adoption of advanced deep learning architectures in stuttering detection (SD) tasks is challenging due to the limited size of the available datasets. To this end, this work introduces the application of speech embeddings extracted from pre-trained deep learning models trained on large audio data...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The adoption of advanced deep learning architectures in stuttering detection
(SD) tasks is challenging due to the limited size of the available datasets. To
this end, this work introduces the application of speech embeddings extracted
from pre-trained deep learning models trained on large audio datasets for
different tasks. In particular, we explore audio representations obtained using
emphasized channel attention, propagation, and aggregation time delay neural
network (ECAPA-TDNN) and Wav2Vec2.0 models trained on VoxCeleb and LibriSpeech
datasets respectively. After extracting the embeddings, we benchmark with
several traditional classifiers, such as the K-nearest neighbour (KNN),
Gaussian naive Bayes, and neural network, for the SD tasks. In comparison to
the standard SD systems trained only on the limited SEP-28k dataset, we obtain
a relative improvement of 12.08%, 28.71%, 37.9% in terms of unweighted average
recall (UAR) over the baselines. Finally, we have shown that combining two
embeddings and concatenating multiple layers of Wav2Vec2.0 can further improve
the UAR by up to 2.60% and 6.32% respectively. |
---|---|
DOI: | 10.48550/arxiv.2306.00689 |