Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data

•Full implementation details of the GP-based method are described in depth.•Coherent training that combines excitation and vocal tract is proposed.•Asymmetric training is proposed for increasing the accuracy without additional costs. Voice conversion (VC) is a technique aiming to mapping the individ...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Speech communication 2014-03, Vol.58 (Mar), p.124-138
Hauptverfasser: Xu, Ning, Tang, Yibing, Bao, Jingyi, Jiang, Aiming, Liu, Xiaofeng, Yang, Zhen
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Full implementation details of the GP-based method are described in depth.•Coherent training that combines excitation and vocal tract is proposed.•Asymmetric training is proposed for increasing the accuracy without additional costs. Voice conversion (VC) is a technique aiming to mapping the individuality of a source speaker to that of a target speaker, wherein Gaussian mixture model (GMM) based methods are evidently prevalent. Despite their wide use, two major problems remains to be resolved, i.e., over-smoothing and over-fitting. The latter one arises naturally when the structure of model is too complicated given limited amount of training data. Recently, a new voice conversion method based on Gaussian processes (GPs) was proposed, whose nonparametric nature ensures that the over-fitting problem can be alleviated significantly. Meanwhile, it is flexible to perform non-linear mapping under the framework of GPs by introducing sophisticated kernel functions. Thus this kind of method deserves to be explored thoroughly in this paper. To further improve the performance of the GP-based method, a strategy for mapping prosodic and spectral features coherently is adopted, making the best use of the intercorrelations embedded among both excitation and vocal tract features. Moreover, the accuracy in computing the kernel functions of GP can be improved by resorting to an asymmetric training strategy that allows the dimensionality of input vectors being reasonably higher than that of the output vectors without additional computational costs. Experiments have been conducted to confirm the effectiveness of the proposed method both objectively and subjectively, which have demonstrated that improvements can be obtained by GP-based method compared to the traditional GMM-based approach.
ISSN:0167-6393
1872-7182
DOI:10.1016/j.specom.2013.11.005