VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech
Despite the significant advancements in Text-to-Speech (TTS) systems, their full utilization in automatic dubbing remains limited. This task necessitates the extraction of voice identity and emotional style from a reference speech in a source language and subsequently transferring them to a target l...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Despite the significant advancements in Text-to-Speech (TTS) systems, their
full utilization in automatic dubbing remains limited. This task necessitates
the extraction of voice identity and emotional style from a reference speech in
a source language and subsequently transferring them to a target language using
cross-lingual TTS techniques. While previous approaches have mainly
concentrated on controlling voice identity within the cross-lingual TTS
framework, there has been limited work on incorporating emotion and voice
identity together. To this end, we introduce an end-to-end Voice Identity and
Emotional Style Controllable Cross-Lingual (VECL) TTS system using multilingual
speakers and an emotion embedding network. Moreover, we introduce content and
style consistency losses to enhance the quality of synthesized speech further.
The proposed system achieved an average relative improvement of 8.83\% compared
to the state-of-the-art (SOTA) methods on a database comprising English and
three Indian languages (Hindi, Telugu, and Marathi). |
---|---|
DOI: | 10.48550/arxiv.2406.08076 |