Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion

Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audio's prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study proposes a source style transfer method based on...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2021-06
Hauptverfasser:	Wang, Zhichao, Zhou, Xinyong, Yang, Fengyu, Li, Tao, Du, Hongqiang, Xie, Lei, Gan, Wendong, Chen, Haitao, Li, Hai
Format:	Artikel
Sprache:	eng
Schlagworte:	Coders Conversion Feature extraction Implicit methods Linguistics Speech recognition Synthesis Voice recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Wang, Zhichao Zhou, Xinyong Yang, Fengyu Li, Tao Du, Hongqiang Xie, Lei Gan, Wendong Chen, Haitao Li, Hai
description	Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audio's prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study proposes a source style transfer method based on recognition-synthesis framework. Previously in speech generation task, prosody can be modeled explicitly with prosodic features or implicitly with a latent prosody extractor. In this paper, taking advantages of both, we model the prosody in a hybrid manner, which effectively combines explicit and implicit methods in a proposed prosody module. Specifically, prosodic features are used to explicit model prosody, while VAE and reference encoder are used to implicitly model prosody, which take Mel spectrum and bottleneck feature as input respectively. Furthermore, adversarial training is introduced to remove speaker-related information from the VAE outputs, avoiding leaking source speaker information while transferring style. Finally, we use a modified self-attention based encoder to extract sentential context from bottleneck features, which also implicitly aggregates the prosodic aspects of source speech from the layered representations. Experiments show that our approach is superior to the baseline and a competitive system in terms of style transfer; meanwhile, the speech quality and speaker similarity are well maintained.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2543473143</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2543473143</sourcerecordid><originalsourceid>FETCH-proquest_journals_25434731433</originalsourceid><addsrcrecordid>eNqNjcEKgkAURYcgSMp_GGgt6IxmezFaRWS0CmSyp44Mb2qeBv59s-gDWl0493DvggVCyiTap0KsWEg0xHEsdrnIMhmwe4lON73Gjld2cg3wapwN8KtTSC04rpFfoLEd6lFbjKoZxx5IE38ogic_eXZWThkDht-s9gOFxQ848vaGLVtlCMJfrtn2UF6LY_Ry9j0BjfXgP9FXtchSmeYySaX8z_oC3JtDlg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2543473143</pqid></control><display><type>article</type><title>Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion</title><source>Free E- Journals</source><creator>Wang, Zhichao ; Zhou, Xinyong ; Yang, Fengyu ; Li, Tao ; Du, Hongqiang ; Xie, Lei ; Gan, Wendong ; Chen, Haitao ; Li, Hai</creator><creatorcontrib>Wang, Zhichao ; Zhou, Xinyong ; Yang, Fengyu ; Li, Tao ; Du, Hongqiang ; Xie, Lei ; Gan, Wendong ; Chen, Haitao ; Li, Hai</creatorcontrib><description>Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audio's prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study proposes a source style transfer method based on recognition-synthesis framework. Previously in speech generation task, prosody can be modeled explicitly with prosodic features or implicitly with a latent prosody extractor. In this paper, taking advantages of both, we model the prosody in a hybrid manner, which effectively combines explicit and implicit methods in a proposed prosody module. Specifically, prosodic features are used to explicit model prosody, while VAE and reference encoder are used to implicitly model prosody, which take Mel spectrum and bottleneck feature as input respectively. Furthermore, adversarial training is introduced to remove speaker-related information from the VAE outputs, avoiding leaking source speaker information while transferring style. Finally, we use a modified self-attention based encoder to extract sentential context from bottleneck features, which also implicitly aggregates the prosodic aspects of source speech from the layered representations. Experiments show that our approach is superior to the baseline and a competitive system in terms of style transfer; meanwhile, the speech quality and speaker similarity are well maintained.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Coders ; Conversion ; Feature extraction ; Implicit methods ; Linguistics ; Speech recognition ; Synthesis ; Voice recognition</subject><ispartof>arXiv.org, 2021-06</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Wang, Zhichao</creatorcontrib><creatorcontrib>Zhou, Xinyong</creatorcontrib><creatorcontrib>Yang, Fengyu</creatorcontrib><creatorcontrib>Li, Tao</creatorcontrib><creatorcontrib>Du, Hongqiang</creatorcontrib><creatorcontrib>Xie, Lei</creatorcontrib><creatorcontrib>Gan, Wendong</creatorcontrib><creatorcontrib>Chen, Haitao</creatorcontrib><creatorcontrib>Li, Hai</creatorcontrib><title>Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion</title><title>arXiv.org</title><description>Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audio's prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study proposes a source style transfer method based on recognition-synthesis framework. Previously in speech generation task, prosody can be modeled explicitly with prosodic features or implicitly with a latent prosody extractor. In this paper, taking advantages of both, we model the prosody in a hybrid manner, which effectively combines explicit and implicit methods in a proposed prosody module. Specifically, prosodic features are used to explicit model prosody, while VAE and reference encoder are used to implicitly model prosody, which take Mel spectrum and bottleneck feature as input respectively. Furthermore, adversarial training is introduced to remove speaker-related information from the VAE outputs, avoiding leaking source speaker information while transferring style. Finally, we use a modified self-attention based encoder to extract sentential context from bottleneck features, which also implicitly aggregates the prosodic aspects of source speech from the layered representations. Experiments show that our approach is superior to the baseline and a competitive system in terms of style transfer; meanwhile, the speech quality and speaker similarity are well maintained.</description><subject>Coders</subject><subject>Conversion</subject><subject>Feature extraction</subject><subject>Implicit methods</subject><subject>Linguistics</subject><subject>Speech recognition</subject><subject>Synthesis</subject><subject>Voice recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjcEKgkAURYcgSMp_GGgt6IxmezFaRWS0CmSyp44Mb2qeBv59s-gDWl0493DvggVCyiTap0KsWEg0xHEsdrnIMhmwe4lON73Gjld2cg3wapwN8KtTSC04rpFfoLEd6lFbjKoZxx5IE38ogic_eXZWThkDht-s9gOFxQ848vaGLVtlCMJfrtn2UF6LY_Ry9j0BjfXgP9FXtchSmeYySaX8z_oC3JtDlg</recordid><startdate>20210626</startdate><enddate>20210626</enddate><creator>Wang, Zhichao</creator><creator>Zhou, Xinyong</creator><creator>Yang, Fengyu</creator><creator>Li, Tao</creator><creator>Du, Hongqiang</creator><creator>Xie, Lei</creator><creator>Gan, Wendong</creator><creator>Chen, Haitao</creator><creator>Li, Hai</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20210626</creationdate><title>Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion</title><author>Wang, Zhichao ; Zhou, Xinyong ; Yang, Fengyu ; Li, Tao ; Du, Hongqiang ; Xie, Lei ; Gan, Wendong ; Chen, Haitao ; Li, Hai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25434731433</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Coders</topic><topic>Conversion</topic><topic>Feature extraction</topic><topic>Implicit methods</topic><topic>Linguistics</topic><topic>Speech recognition</topic><topic>Synthesis</topic><topic>Voice recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Wang, Zhichao</creatorcontrib><creatorcontrib>Zhou, Xinyong</creatorcontrib><creatorcontrib>Yang, Fengyu</creatorcontrib><creatorcontrib>Li, Tao</creatorcontrib><creatorcontrib>Du, Hongqiang</creatorcontrib><creatorcontrib>Xie, Lei</creatorcontrib><creatorcontrib>Gan, Wendong</creatorcontrib><creatorcontrib>Chen, Haitao</creatorcontrib><creatorcontrib>Li, Hai</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wang, Zhichao</au><au>Zhou, Xinyong</au><au>Yang, Fengyu</au><au>Li, Tao</au><au>Du, Hongqiang</au><au>Xie, Lei</au><au>Gan, Wendong</au><au>Chen, Haitao</au><au>Li, Hai</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion</atitle><jtitle>arXiv.org</jtitle><date>2021-06-26</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>Current voice conversion (VC) methods can successfully convert timbre of the audio. As modeling source audio's prosody effectively is a challenging task, there are still limitations of transferring source style to the converted speech. This study proposes a source style transfer method based on recognition-synthesis framework. Previously in speech generation task, prosody can be modeled explicitly with prosodic features or implicitly with a latent prosody extractor. In this paper, taking advantages of both, we model the prosody in a hybrid manner, which effectively combines explicit and implicit methods in a proposed prosody module. Specifically, prosodic features are used to explicit model prosody, while VAE and reference encoder are used to implicitly model prosody, which take Mel spectrum and bottleneck feature as input respectively. Furthermore, adversarial training is introduced to remove speaker-related information from the VAE outputs, avoiding leaking source speaker information while transferring style. Finally, we use a modified self-attention based encoder to extract sentential context from bottleneck features, which also implicitly aggregates the prosodic aspects of source speech from the layered representations. Experiments show that our approach is superior to the baseline and a competitive system in terms of style transfer; meanwhile, the speech quality and speaker similarity are well maintained.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2021-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2543473143
source	Free E- Journals
subjects	Coders Conversion Feature extraction Implicit methods Linguistics Speech recognition Synthesis Voice recognition
title	Enriching Source Style Transfer in Recognition-Synthesis based Non-Parallel Voice Conversion
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T13%3A08%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Enriching%20Source%20Style%20Transfer%20in%20Recognition-Synthesis%20based%20Non-Parallel%20Voice%20Conversion&rft.jtitle=arXiv.org&rft.au=Wang,%20Zhichao&rft.date=2021-06-26&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2543473143%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2543473143&rft_id=info:pmid/&rfr_iscdi=true