TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers
Neural codec language model (LM) has demonstrated strong capability in zero-shot text-to-speech (TTS) synthesis. However, the codec LM often suffers from limitations in inference speed and stability, due to its auto-regressive nature and implicit alignment between text and audio. In this work, to ha...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Neural codec language model (LM) has demonstrated strong capability in
zero-shot text-to-speech (TTS) synthesis. However, the codec LM often suffers
from limitations in inference speed and stability, due to its auto-regressive
nature and implicit alignment between text and audio. In this work, to handle
these challenges, we introduce a new variant of neural codec LM, namely TacoLM.
Specifically, TacoLM introduces a gated attention mechanism to improve the
training and inference efficiency and reduce the model size. Meanwhile, an
additional gated cross-attention layer is included for each decoder layer,
which improves the efficiency and content accuracy of the synthesized speech.
In the evaluation of the Librispeech corpus, the proposed TacoLM achieves a
better word error rate, speaker similarity, and mean opinion score, with 90%
fewer parameters and 5.2 times speed up, compared with VALL-E. Demo and code is
available at https://ereboas.github.io/TacoLM/. |
---|---|
DOI: | 10.48550/arxiv.2406.15752 |