Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-06
Hauptverfasser: Anastassiou, Philip, Chen, Jiawei, Chen, Jitong, Chen, Yuanzhe, Chen, Zhuo, Chen, Ziyi, Cong, Jian, Deng, Lelai, Chuang, Ding, Gao, Lu, Gong, Mingqing, Huang, Peisong, Huang, Qingqing, Huang, Zhiying, Huo, Yuanyuan, Jia, Dongya, Li, Chumin, Li, Feiya, Li, Hui, Li, Jiaxin, Li, Xiaoyang, Li, Xingxing, Liu, Lin, Liu, Shouda, Liu, Sichao, Liu, Xudong, Liu, Yuchen, Liu, Zhengxi, Lu, Lu, Pan, Junjie, Wang, Xin, Wang, Yuping, Wang, Yuxuan, Wei, Zhen, Wu, Jian, Yao, Chao, Yang, Yifeng, Yuanhao Yi, Zhang, Junteng, Zhang, Qidi, Zhang, Shuo, Zhang, Wenjie, Zhang, Yang, Zhao, Zilin, Zhong, Dejian, Zhuang, Xiaobin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named \(\text{Seed-TTS}_\text{DiT}\), which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, \(\text{Seed-TTS}_\text{DiT}\) does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at \url{https://bytedancespeech.github.io/seedtts_tech_report}.
ISSN:2331-8422