Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech Enhancement

In recent years, Text-to-Speech (TTS) technology has advanced, with research focusing on multi-speaker TTS capable of voice cloning. In 2023, Wang et al. introduced Vall-E, a Transformer-based neural codec language model, achieving state-of-the-art results in voice cloning. However, limited research...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2024, Vol.12, p.193131-193140
Hauptverfasser: Raditya Pratama Roosadi, Hizkia, Rivai Ginanjar, Rizki, Puji Lestari, Dessi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In recent years, Text-to-Speech (TTS) technology has advanced, with research focusing on multi-speaker TTS capable of voice cloning. In 2023, Wang et al. introduced Vall-E, a Transformer-based neural codec language model, achieving state-of-the-art results in voice cloning. However, limited research has applied such models to the Indonesian language, leaving room for improvement in speech synthesis. This paper explores the development a TTS system using Vall-E and explores enhancements of the speech synthesis. The dataset, comprising audio-transcript pairs, was sourced from previous Indonesian speech processing research. Data preparation involved converting audio into codec tokens and transcripts into phoneme tokens. Following Wang et al., a neural codec language model was built and trained using open-source tools. Additionally, this paper explores the integration VoiceFixer tool for speech enhancement. The inclusion of VoiceFixer improved the naturalness MOS score from 3.34 to 3.95, demonstrating its effectiveness in enhancing speech quality. Overall, the TTS system achieved a naturalness MOS score of 3.489 and a similarity MOS score of 3.521, with a WER of 19.71% and speaker embedding vector similarity visualization. These results indicate that the Vall-E model can produce Indonesian speech with high speaker similarity. The development also emphasizes the importance of factors like the number of speakers, data selection, processing components, modeling, and speech duration during training for synthesis quality.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2024.3519870