DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning
Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed environment, incurring a high cost for data collection. To solve this problem, existing noise-robust TTS methods are intended to use noisy speech corpora as training data. However, they only address either...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Most text-to-speech (TTS) methods use high-quality speech corpora recorded in
a well-designed environment, incurring a high cost for data collection. To
solve this problem, existing noise-robust TTS methods are intended to use noisy
speech corpora as training data. However, they only address either
time-invariant or time-variant noises. We propose a degradation-robust TTS
method, which can be trained on speech corpora that contain both additive
noises and environmental distortions. It jointly represents the time-variant
additive noises with a frame-level encoder and the time-invariant environmental
distortions with an utterance-level encoder. We also propose a regularization
method to attain clean environmental embedding that is disentangled from the
utterance-dependent information such as linguistic contents and speaker
characteristics. Evaluation results show that our method achieved significantly
higher-quality synthetic speech than previous methods in the condition
including both additive noise and reverberation. |
---|---|
DOI: | 10.48550/arxiv.2203.15683 |