Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data
Collecting high-quality studio recordings of audio is challenging, which limits the language coverage of text-to-speech (TTS) systems. This paper proposes a framework for scaling a multilingual TTS model to 100+ languages using found data without supervision. The proposed framework combines speech-t...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Collecting high-quality studio recordings of audio is challenging, which
limits the language coverage of text-to-speech (TTS) systems. This paper
proposes a framework for scaling a multilingual TTS model to 100+ languages
using found data without supervision. The proposed framework combines
speech-text encoder pretraining with unsupervised training using untranscribed
speech and unspoken text data sources, thereby leveraging massively
multilingual joint speech and text representation learning. Without any
transcribed speech in a new language, this TTS model can generate intelligible
speech in >30 unseen languages (CER difference of |
---|---|
DOI: | 10.48550/arxiv.2402.18932 |