Continuous vocoder applied in deep neural network based voice conversion
In this paper, a novel vocoder is proposed for a Statistical Voice Conversion (SVC) framework using deep neural network, where multiple features from the speech of two speakers (source and target) are converted acoustically. Traditional conversion methods focus on the prosodic feature represented by...
Gespeichert in:
Veröffentlicht in: | Multimedia tools and applications 2019-12, Vol.78 (23), p.33549-33572 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, a novel vocoder is proposed for a Statistical Voice Conversion (SVC) framework using deep neural network, where multiple features from the speech of two speakers (source and target) are converted acoustically. Traditional conversion methods focus on the prosodic feature represented by the discontinuous fundamental frequency (F0) and the spectral envelope. Studies have shown that speech analysis/synthesis solutions play an important role in the overall quality of the converted voice. Recently, we have proposed a new continuous vocoder, originally for statistical parametric speech synthesis, in which all parameters are continuous. Therefore, this work introduces a new method by using a continuous F0 (contF0) in SVC to avoid alignment errors that may happen in voiced and unvoiced segments and can degrade the converted speech. Our contribution includes the following. (1) We integrate into the SVC framework the continuous vocoder, which provides an advanced model of the excitation signal, by converting its contF0, maximum voiced frequency, and spectral features. (2) We show that the feed-forward deep neural network (FF-DNN) using our vocoder yields high quality conversion. (3) We apply a geometric approach to spectral subtraction (GA-SS) in the final stage of the proposed framework, to improve the signal-to-noise ratio of the converted speech. Our experimental results, using two male and one female speakers, have shown that the resulting converted speech with the proposed SVC technique is similar to the target speaker and gives state-of-the-art performance as measured by objective evaluation and subjective listening tests. |
---|---|
ISSN: | 1380-7501 1573-7721 |
DOI: | 10.1007/s11042-019-08198-5 |