SF-Speech: Straightened Flow for Zero-Shot Voice Clone on Small-Scale Dataset
Large-scale speech generation models have achieved impressive performance in the zero-shot voice clone tasks relying on large-scale datasets. However, exploring how to achieve zero-shot voice clone with small-scale datasets is also essential. This paper proposes SF-Speech, a novel state-of-the-art v...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large-scale speech generation models have achieved impressive performance in
the zero-shot voice clone tasks relying on large-scale datasets. However,
exploring how to achieve zero-shot voice clone with small-scale datasets is
also essential. This paper proposes SF-Speech, a novel state-of-the-art voice
clone model based on ordinary differential equations and contextual learning.
Unlike the previous works, SF-Speech employs a multi-stage generation strategy
to obtain the coarse acoustic feature and utilizes this feature to straighten
the curved reverse trajectories caused by training the ordinary differential
equation model with flow matching. In addition, we find the difference between
the local correlations of different types of acoustic features and demonstrate
the potential role of 2D convolution in modeling mel-spectrogram features.
After training with less than 1000 hours of speech, SF-Speech significantly
outperforms those methods based on global speaker embedding or autoregressive
large language models. In particular, SF-Speech also shows a significant
advantage over VoiceBox, the best-performing ordinary differential equation
model, in speech intelligibility (a relative decrease of 22.4\% on word error
rate) and timbre similarity (a relative improvement of 5.6\% on cosine
distance) at a similar scale of parameters, and even keep a slight advantage
when the parameters of VoiceBox are tripled. |
---|---|
DOI: | 10.48550/arxiv.2410.12399 |