Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation
Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use a neural architecture search but often suffer training costs. We present Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Several solutions for lightweight TTS have shown promising results. Still,
they either rely on a hand-crafted design that reaches non-optimum size or use
a neural architecture search but often suffer training costs. We present
Nix-TTS, a lightweight TTS achieved via knowledge distillation to a
high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free)
TTS teacher model. Specifically, we offer module-wise distillation, enabling
flexible and independent distillation to the encoder and decoder module. The
resulting Nix-TTS inherited the advantageous properties of being
non-autoregressive and end-to-end from the teacher, yet significantly smaller
in size, with only 5.23M parameters or up to 89.34% reduction of the teacher
model; it also achieves over 3.04x and 8.36x inference speedup on Intel-i7 CPU
and Raspberry Pi 3B respectively and still retains a fair voice naturalness and
intelligibility compared to the teacher model. We provide pretrained models and
audio samples of Nix-TTS. |
---|---|
DOI: | 10.48550/arxiv.2203.15643 |