Voice Attribute Editing with Text Prompt
Despite recent advancements in speech generation with text prompt providing control over speech style, voice attributes in synthesized speech remain elusive and challenging to control. This paper introduces a novel task: voice attribute editing with text prompt, with the goal of making relative modi...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Despite recent advancements in speech generation with text prompt providing
control over speech style, voice attributes in synthesized speech remain
elusive and challenging to control. This paper introduces a novel task: voice
attribute editing with text prompt, with the goal of making relative
modifications to voice attributes according to the actions described in the
text prompt. To solve this task, VoxEditor, an end-to-end generative model, is
proposed. In VoxEditor, addressing the insufficiency of text prompt, a Residual
Memory (ResMem) block is designed, that efficiently maps voice attributes and
these descriptors into the shared feature space. Additionally, the ResMem block
is enhanced with a voice attribute degree prediction (VADP) block to align
voice attributes with corresponding descriptors, addressing the imprecision of
text prompt caused by non-quantitative descriptions of voice attributes. We
also establish the open-source VCTK-RVA dataset, which leads the way in manual
annotations detailing voice characteristic differences among different
speakers. Extensive experiments demonstrate the effectiveness and
generalizability of our proposed method in terms of both objective and
subjective metrics. The dataset and audio samples are available on the website. |
---|---|
DOI: | 10.48550/arxiv.2404.08857 |