The Effect of Spoken Language on Speech Enhancement using Self-Supervised Speech Representation Loss Functions
Recent work in the field of speech enhancement (SE) has involved the use of self-supervised speech representations (SSSRs) as feature transformations in loss functions. However, in prior work, very little attention has been paid to the relationship between the language of the audio used to train the...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent work in the field of speech enhancement (SE) has involved the use of
self-supervised speech representations (SSSRs) as feature transformations in
loss functions. However, in prior work, very little attention has been paid to
the relationship between the language of the audio used to train the
self-supervised representation and that used to train the SE system.
Enhancement models trained using a loss function which incorporates a
self-supervised representation that shares exactly the language of the noisy
data used to train the SE system show better performance than those which do
not match exactly. This may lead to enhancement systems which are language
specific and as such do not generalise well to unseen languages, unlike models
trained using traditional spectrogram or time domain loss functions. In this
work, SE models are trained and tested on a number of different languages, with
self-supervised representations which themselves are trained using different
language combinations and with differing network structures as loss function
representations. These models are then tested across unseen languages and their
performances are analysed. It is found that the training language of the
self-supervised representation appears to have a minor effect on enhancement
performance, the amount of training data of a particular language, however,
greatly affects performance. |
---|---|
DOI: | 10.48550/arxiv.2307.14502 |