End-to-end voice conversion model and training method and reasoning method thereof

The invention provides an end-to-end voice conversion model and a training method and a reasoning method thereof, the model is based on a conditional variation encoder, an acoustic model and a vocoder are trained together during training, and mismatching of training and reasoning is avoided. A large...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: WANG FEI, WANG HUANLIANG, WU TIANXIN
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention provides an end-to-end voice conversion model and a training method and a reasoning method thereof, the model is based on a conditional variation encoder, an acoustic model and a vocoder are trained together during training, and mismatching of training and reasoning is avoided. A large-scale pre-training Hubert model is used for extracting content information representation, speaker information in the content representation can be preliminarily stripped, and initial and final information in the content representation is enriched. Speaker information in content information representation is further stripped by using a gradient inversion method, so that tone leakage is avoided. Through the codebook quantization method, the complexity of content representation is simplified, and the timbre stripping capability is improved. Besides, by adopting a model distillation method based on KL divergence, a content extractor with complicated calculation is distilled to a student network with more efficient ca