Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting
Most speech self-supervised learning (SSL) models are trained with a pretext task which consists in predicting missing parts of the input signal, either future segments (causal prediction) or segments masked anywhere within the input (non-causal prediction). Learned speech representations can then b...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Most speech self-supervised learning (SSL) models are trained with a pretext
task which consists in predicting missing parts of the input signal, either
future segments (causal prediction) or segments masked anywhere within the
input (non-causal prediction). Learned speech representations can then be
efficiently transferred to downstream tasks (e.g., automatic speech or speaker
recognition). In the present study, we investigate the use of a speech SSL
model for speech inpainting, that is reconstructing a missing portion of a
speech signal from its surrounding context, i.e., fulfilling a downstream task
that is very similar to the pretext task. To that purpose, we combine an SSL
encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role
of a decoder. In particular, we propose two solutions to match the HuBERT
output with the HiFiGAN input, by freezing one and fine-tuning the other, and
vice versa. Performance of both approaches was assessed in single- and
multi-speaker settings, for both informed and blind inpainting configurations
(i.e., the position of the mask is known or unknown, respectively), with
different objective metrics and a perceptual evaluation. Performances show that
if both solutions allow to correctly reconstruct signal portions up to the size
of 200ms (and even 400ms in some cases), fine-tuning the SSL encoder provides a
more accurate signal reconstruction in the single-speaker setting case, while
freezing it (and training the neural vocoder instead) is a better strategy when
dealing with multi-speaker data. |
---|---|
DOI: | 10.48550/arxiv.2405.20101 |