Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion

•CTC loss is used to guide the VQ-VAE to learn pure content representations.•Experiments show generated speech with better naturalness and similarity.•Thorough analysis provides useful insight into representation disentangling. Vector quantized variational autoencoder (VQ-VAE) has recently become an...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Digital signal processing 2021-09, Vol.116, p.103110, Article 103110
Hauptverfasser: Kang, Xiao, Huang, Hao, Hu, Ying, Huang, Zhihua
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•CTC loss is used to guide the VQ-VAE to learn pure content representations.•Experiments show generated speech with better naturalness and similarity.•Thorough analysis provides useful insight into representation disentangling. Vector quantized variational autoencoder (VQ-VAE) has recently become an increasingly popular method in non-parallel zero-shot voice conversion (VC). The reason behind is that VQ-VAE is capable of disentangling the content and the speaker representations from the speech by using a content encoder and a speaker encoder, which is suitable for the VC task that makes the speech of a source speaker sound like the speech of the target speaker without changing the linguistic content. However, the converted speech is not satisfying because it is difficult to disentangle the pure content representations from the acoustic features due to the lack of linguistic supervision for the content encoder. To address this issue, under the framework of VQ-VAE, connectionist temporal classification (CTC) loss is proposed to guide the content encoder to learn pure content representations by using an auxiliary network. Based on the fact that the CTC loss is not affected by the sequence length of the output of the content encoder, adding the linguistic supervision to the content encoder can be much easier. This non-parallel many-to-many voice conversion model is named as CTC-VQ-VAE. VC experiments on the CMU ARCTIC and VCTK corpus are carried out to evaluate the proposed method. Both the objective and the subjective results show that the proposed approach significantly improves the speech quality and speaker similarity of the converted speech, compared with the traditional VQ-VAE method.
ISSN:1051-2004
1095-4333
DOI:10.1016/j.dsp.2021.103110