Building Synthetic Speaker Profiles in Text-to-Speech Systems
The diversity of speaker profiles in multi-speaker TTS systems is a crucial aspect of its performance, as it measures how many different speaker profiles TTS systems could possibly synthesize. However, this important aspect is often overlooked when building multi-speaker TTS systems and there is no...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The diversity of speaker profiles in multi-speaker TTS systems is a crucial
aspect of its performance, as it measures how many different speaker profiles
TTS systems could possibly synthesize. However, this important aspect is often
overlooked when building multi-speaker TTS systems and there is no established
framework to evaluate this diversity. The reason behind is that most
multi-speaker TTS systems are limited to generate speech signals with the same
speaker profiles as its training data. They often use discrete speaker
embedding vectors which have a one-to-one correspondence with individual
speakers. This correspondence limits TTS systems and hinders their capability
of generating unseen speaker profiles that did not appear during training. In
this paper, we aim to build multi-speaker TTS systems that have a greater
variety of speaker profiles and can generate new synthetic speaker profiles
that are different from training data. To this end, we propose to use
generative models with a triplet loss and a specific shuffle mechanism. In our
experiments, the effectiveness and advantages of the proposed method have been
demonstrated in terms of both the distinctiveness and intelligibility of
synthesized speech signals. |
---|---|
DOI: | 10.48550/arxiv.2202.03125 |