Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion
Voice conversion (VC) aims at altering a person's voice to make it sound similar to the voice of another person while preserving linguistic content. Existing methods suffer from a dilemma between content intelligibility and speaker similarity; i.e., methods with higher intelligibility usually h...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Voice conversion (VC) aims at altering a person's voice to make it sound
similar to the voice of another person while preserving linguistic content.
Existing methods suffer from a dilemma between content intelligibility and
speaker similarity; i.e., methods with higher intelligibility usually have a
lower speaker similarity, while methods with higher speaker similarity usually
require plenty of target speaker voice data to achieve high intelligibility. In
this work, we propose a novel method \textit{Phoneme Hallucinator} that
achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model;
it adopts a novel model to hallucinate diversified and high-fidelity target
speaker phonemes based just on a short target speaker voice (e.g. 3 seconds).
The hallucinated phonemes are then exploited to perform neighbor-based voice
conversion. Our model is a text-free, any-to-any VC model that requires no text
annotations and supports conversion to any unseen speaker. Objective and
subjective evaluations show that \textit{Phoneme Hallucinator} outperforms
existing VC methods for both intelligibility and speaker similarity. |
---|---|
DOI: | 10.48550/arxiv.2308.06382 |