Efficient Personalized Speech Enhancement Through Self-Supervised Learning

This work presents self-supervised learning methods for monaural speaker-specific (i.e., personalized) speech enhancement models. While general-purpose models must broadly address many speakers, personalized models can adapt to a particular speaker's voice, expecting to solve a narrower problem...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE journal of selected topics in signal processing 2022-10, Vol.16 (6), p.1342-1356
Hauptverfasser:	Sivaraman, Aswin, Kim, Minje
Format:	Artikel
Sprache:	eng
Schlagworte:	Cognitive tasks Customization Data efficiency Data models model complexity Noise measurement personalized speech enhancement Self-supervised learning Speech Speech enhancement Speech processing Supervised learning Training
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This work presents self-supervised learning methods for monaural speaker-specific (i.e., personalized) speech enhancement models. While general-purpose models must broadly address many speakers, personalized models can adapt to a particular speaker's voice, expecting to solve a narrower problem. Hence, personalization can achieve more optimal performance in addition to reducing computational complexity. However, naive personalization methods can inconveniently require clean speech from the target user, e.g., due to subpar recording conditions. To this end, we pose personalization as either a zero-shot task, in which no clean speech of the target speaker is used, or a few-shot learning task, which is to minimize the duration of the clean speech used for transfer learning. With this paper, we propose self-supervised learning methods as a solution to both zero- and few-shot personalization tasks. The proposed methods learn the personalized speech features from unlabeled data (i.e., in-the-wild noisy recordings from the target user) rather than from the clean sources. We investigate three different self-supervised learning mechanisms. We set up a pseudo speech enhancement problem as a pretext task, which pretrains the models to estimate noisy speech as if it were the clean target. Contrastive learning and data purification methods regularize the loss function of the pseudo enhancement problem, overcoming the limitations of learning from unlabeled data. We assess our methods by personalizing the well-known ConvTasNet architecture to twenty different target speakers. The results show that self-supervision-based personalization improves the original ConvTasNet's enhancement quality with fewer model parameters and less clean data from the target user.
ISSN:	1932-4553 1941-0484
DOI:	10.1109/JSTSP.2022.3181782