Learning Structured Sparse Representations for Voice Conversion

Sparse-coding techniques for voice conversion assume that an utterance can be decomposed into a sparse code that only carries linguistic contents, and a dictionary of atoms that captures the speakers' characteristics. However, conventional dictionary-construction and sparse-coding algorithms ra...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2020, Vol.28, p.343-354
Hauptverfasser:	Ding, Shaojin, Zhao, Guanlong, Liberatore, Christopher, Gutierrez-Osuna, Ricardo
Format:	Artikel
Sprache:	eng
Schlagworte:	Ablation Acoustics Algorithms Atomic properties Clustering algorithms Clusters Coding Conversion Decomposition Dictionaries dictionary learning Encoding Learning Machine learning Phonetics Representations sparse coding sparse representation Speech Speech processing Speech sounds Training Voice Voice conversion
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	354
container_issue
container_start_page	343
container_title	IEEE/ACM transactions on audio, speech, and language processing
container_volume	28
creator	Ding, Shaojin Zhao, Guanlong Liberatore, Christopher Gutierrez-Osuna, Ricardo
description	Sparse-coding techniques for voice conversion assume that an utterance can be decomposed into a sparse code that only carries linguistic contents, and a dictionary of atoms that captures the speakers' characteristics. However, conventional dictionary-construction and sparse-coding algorithms rarely meet this assumption. The result is that the sparse code is no longer speaker-independent, which leads to lower voice-conversion performance. In this paper, we propose a Cluster-Structured Sparse Representation (CSSR) that improves the speaker independence of the representations. CSSR consists of two complementary components: a Cluster-Structured Dictionary Learning module that groups atoms in the dictionary into clusters, and a Cluster-Selective Objective Function that encourages each speech frame to be represented by atoms from a small number of clusters. We conducted four experiments on the CMU ARCTIC corpus to evaluate the proposed method. In a first ablation study, results show that each of the two CSSR components enhances speaker independence, and that combining both components leads to further improvements. In a second experiment, we find that CSSR uses increasingly larger dictionaries more efficiently than phoneme-based representations by allowing finer-grained decompositions of speech sounds. In a third experiment, results from objective and subjective measurements show that CSSR outperforms prior voice-conversion methods, improving the acoustic quality of the synthesized speech while retaining the target speaker's voice identity. Finally, we show that the CSSR captures latent (i.e., phonetic) information in the speech signal.
doi_str_mv	10.1109/TASLP.2019.2955289
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_8910392</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8910392</ieee_id><sourcerecordid>2330021547</sourcerecordid><originalsourceid>FETCH-LOGICAL-c339t-9dc16cbd4919592bbdfc29263f4055b860d8a942472d2eb9706a955ba237fe1f3</originalsourceid><addsrcrecordid>eNo9kE1Lw0AQhhdRsNT-Ab0EPCfOzm4-5iSl-AUBxVavyyaZSIomcTcR_PemtnqaYXifGeYR4lxCJCXQ1Wa5zp8iBEkRUhxjRkdihgopJAX6-K9HglOx8H4LABJSolTPxHXO1rVN-xasBzeWw-i4Cta9dZ6DZ-4de24HOzRd64O6c8Fr15QcrLr2i52fpmfipLbvnheHOhcvtzeb1X2YP949rJZ5WCpFQ0hVKZOyqDRJigmLoqpLJExUrSGOiyyBKrOkUadYIReUQmKnVwqLKq1Z1mouLvd7e9d9juwHs-1G104nDSoFgDLW6ZTCfap0nfeOa9O75sO6byPB7FyZX1dm58ocXE3QxR5qmPkfyEiCIlQ_YHlk-Q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2330021547</pqid></control><display><type>article</type><title>Learning Structured Sparse Representations for Voice Conversion</title><source>IEEE Electronic Library (IEL)</source><creator>Ding, Shaojin ; Zhao, Guanlong ; Liberatore, Christopher ; Gutierrez-Osuna, Ricardo</creator><creatorcontrib>Ding, Shaojin ; Zhao, Guanlong ; Liberatore, Christopher ; Gutierrez-Osuna, Ricardo</creatorcontrib><description>Sparse-coding techniques for voice conversion assume that an utterance can be decomposed into a sparse code that only carries linguistic contents, and a dictionary of atoms that captures the speakers' characteristics. However, conventional dictionary-construction and sparse-coding algorithms rarely meet this assumption. The result is that the sparse code is no longer speaker-independent, which leads to lower voice-conversion performance. In this paper, we propose a Cluster-Structured Sparse Representation (CSSR) that improves the speaker independence of the representations. CSSR consists of two complementary components: a Cluster-Structured Dictionary Learning module that groups atoms in the dictionary into clusters, and a Cluster-Selective Objective Function that encourages each speech frame to be represented by atoms from a small number of clusters. We conducted four experiments on the CMU ARCTIC corpus to evaluate the proposed method. In a first ablation study, results show that each of the two CSSR components enhances speaker independence, and that combining both components leads to further improvements. In a second experiment, we find that CSSR uses increasingly larger dictionaries more efficiently than phoneme-based representations by allowing finer-grained decompositions of speech sounds. In a third experiment, results from objective and subjective measurements show that CSSR outperforms prior voice-conversion methods, improving the acoustic quality of the synthesized speech while retaining the target speaker's voice identity. Finally, we show that the CSSR captures latent (i.e., phonetic) information in the speech signal.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2019.2955289</identifier><identifier>CODEN: ITASD8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Ablation ; Acoustics ; Algorithms ; Atomic properties ; Clustering algorithms ; Clusters ; Coding ; Conversion ; Decomposition ; Dictionaries ; dictionary learning ; Encoding ; Learning ; Machine learning ; Phonetics ; Representations ; sparse coding ; sparse representation ; Speech ; Speech processing ; Speech sounds ; Training ; Voice ; Voice conversion</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2020, Vol.28, p.343-354</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c339t-9dc16cbd4919592bbdfc29263f4055b860d8a942472d2eb9706a955ba237fe1f3</citedby><cites>FETCH-LOGICAL-c339t-9dc16cbd4919592bbdfc29263f4055b860d8a942472d2eb9706a955ba237fe1f3</cites><orcidid>0000-0002-2108-3111 ; 0000-0002-6059-4053 ; 0000-0002-5871-0596</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8910392$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,4024,27923,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8910392$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Ding, Shaojin</creatorcontrib><creatorcontrib>Zhao, Guanlong</creatorcontrib><creatorcontrib>Liberatore, Christopher</creatorcontrib><creatorcontrib>Gutierrez-Osuna, Ricardo</creatorcontrib><title>Learning Structured Sparse Representations for Voice Conversion</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Sparse-coding techniques for voice conversion assume that an utterance can be decomposed into a sparse code that only carries linguistic contents, and a dictionary of atoms that captures the speakers' characteristics. However, conventional dictionary-construction and sparse-coding algorithms rarely meet this assumption. The result is that the sparse code is no longer speaker-independent, which leads to lower voice-conversion performance. In this paper, we propose a Cluster-Structured Sparse Representation (CSSR) that improves the speaker independence of the representations. CSSR consists of two complementary components: a Cluster-Structured Dictionary Learning module that groups atoms in the dictionary into clusters, and a Cluster-Selective Objective Function that encourages each speech frame to be represented by atoms from a small number of clusters. We conducted four experiments on the CMU ARCTIC corpus to evaluate the proposed method. In a first ablation study, results show that each of the two CSSR components enhances speaker independence, and that combining both components leads to further improvements. In a second experiment, we find that CSSR uses increasingly larger dictionaries more efficiently than phoneme-based representations by allowing finer-grained decompositions of speech sounds. In a third experiment, results from objective and subjective measurements show that CSSR outperforms prior voice-conversion methods, improving the acoustic quality of the synthesized speech while retaining the target speaker's voice identity. Finally, we show that the CSSR captures latent (i.e., phonetic) information in the speech signal.</description><subject>Ablation</subject><subject>Acoustics</subject><subject>Algorithms</subject><subject>Atomic properties</subject><subject>Clustering algorithms</subject><subject>Clusters</subject><subject>Coding</subject><subject>Conversion</subject><subject>Decomposition</subject><subject>Dictionaries</subject><subject>dictionary learning</subject><subject>Encoding</subject><subject>Learning</subject><subject>Machine learning</subject><subject>Phonetics</subject><subject>Representations</subject><subject>sparse coding</subject><subject>sparse representation</subject><subject>Speech</subject><subject>Speech processing</subject><subject>Speech sounds</subject><subject>Training</subject><subject>Voice</subject><subject>Voice conversion</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1Lw0AQhhdRsNT-Ab0EPCfOzm4-5iSl-AUBxVavyyaZSIomcTcR_PemtnqaYXifGeYR4lxCJCXQ1Wa5zp8iBEkRUhxjRkdihgopJAX6-K9HglOx8H4LABJSolTPxHXO1rVN-xasBzeWw-i4Cta9dZ6DZ-4de24HOzRd64O6c8Fr15QcrLr2i52fpmfipLbvnheHOhcvtzeb1X2YP949rJZ5WCpFQ0hVKZOyqDRJigmLoqpLJExUrSGOiyyBKrOkUadYIReUQmKnVwqLKq1Z1mouLvd7e9d9juwHs-1G104nDSoFgDLW6ZTCfap0nfeOa9O75sO6byPB7FyZX1dm58ocXE3QxR5qmPkfyEiCIlQ_YHlk-Q</recordid><startdate>2020</startdate><enddate>2020</enddate><creator>Ding, Shaojin</creator><creator>Zhao, Guanlong</creator><creator>Liberatore, Christopher</creator><creator>Gutierrez-Osuna, Ricardo</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-2108-3111</orcidid><orcidid>https://orcid.org/0000-0002-6059-4053</orcidid><orcidid>https://orcid.org/0000-0002-5871-0596</orcidid></search><sort><creationdate>2020</creationdate><title>Learning Structured Sparse Representations for Voice Conversion</title><author>Ding, Shaojin ; Zhao, Guanlong ; Liberatore, Christopher ; Gutierrez-Osuna, Ricardo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c339t-9dc16cbd4919592bbdfc29263f4055b860d8a942472d2eb9706a955ba237fe1f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Ablation</topic><topic>Acoustics</topic><topic>Algorithms</topic><topic>Atomic properties</topic><topic>Clustering algorithms</topic><topic>Clusters</topic><topic>Coding</topic><topic>Conversion</topic><topic>Decomposition</topic><topic>Dictionaries</topic><topic>dictionary learning</topic><topic>Encoding</topic><topic>Learning</topic><topic>Machine learning</topic><topic>Phonetics</topic><topic>Representations</topic><topic>sparse coding</topic><topic>sparse representation</topic><topic>Speech</topic><topic>Speech processing</topic><topic>Speech sounds</topic><topic>Training</topic><topic>Voice</topic><topic>Voice conversion</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ding, Shaojin</creatorcontrib><creatorcontrib>Zhao, Guanlong</creatorcontrib><creatorcontrib>Liberatore, Christopher</creatorcontrib><creatorcontrib>Gutierrez-Osuna, Ricardo</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ding, Shaojin</au><au>Zhao, Guanlong</au><au>Liberatore, Christopher</au><au>Gutierrez-Osuna, Ricardo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Learning Structured Sparse Representations for Voice Conversion</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2020</date><risdate>2020</risdate><volume>28</volume><spage>343</spage><epage>354</epage><pages>343-354</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASD8</coden><abstract>Sparse-coding techniques for voice conversion assume that an utterance can be decomposed into a sparse code that only carries linguistic contents, and a dictionary of atoms that captures the speakers' characteristics. However, conventional dictionary-construction and sparse-coding algorithms rarely meet this assumption. The result is that the sparse code is no longer speaker-independent, which leads to lower voice-conversion performance. In this paper, we propose a Cluster-Structured Sparse Representation (CSSR) that improves the speaker independence of the representations. CSSR consists of two complementary components: a Cluster-Structured Dictionary Learning module that groups atoms in the dictionary into clusters, and a Cluster-Selective Objective Function that encourages each speech frame to be represented by atoms from a small number of clusters. We conducted four experiments on the CMU ARCTIC corpus to evaluate the proposed method. In a first ablation study, results show that each of the two CSSR components enhances speaker independence, and that combining both components leads to further improvements. In a second experiment, we find that CSSR uses increasingly larger dictionaries more efficiently than phoneme-based representations by allowing finer-grained decompositions of speech sounds. In a third experiment, results from objective and subjective measurements show that CSSR outperforms prior voice-conversion methods, improving the acoustic quality of the synthesized speech while retaining the target speaker's voice identity. Finally, we show that the CSSR captures latent (i.e., phonetic) information in the speech signal.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2019.2955289</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-2108-3111</orcidid><orcidid>https://orcid.org/0000-0002-6059-4053</orcidid><orcidid>https://orcid.org/0000-0002-5871-0596</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2329-9290
ispartof	IEEE/ACM transactions on audio, speech, and language processing, 2020, Vol.28, p.343-354
issn	2329-9290 2329-9304
language	eng
recordid	cdi_ieee_primary_8910392
source	IEEE Electronic Library (IEL)
subjects	Ablation Acoustics Algorithms Atomic properties Clustering algorithms Clusters Coding Conversion Decomposition Dictionaries dictionary learning Encoding Learning Machine learning Phonetics Representations sparse coding sparse representation Speech Speech processing Speech sounds Training Voice Voice conversion
title	Learning Structured Sparse Representations for Voice Conversion
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T09%3A43%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Learning%20Structured%20Sparse%20Representations%20for%20Voice%20Conversion&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Ding,%20Shaojin&rft.date=2020&rft.volume=28&rft.spage=343&rft.epage=354&rft.pages=343-354&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASD8&rft_id=info:doi/10.1109/TASLP.2019.2955289&rft_dat=%3Cproquest_RIE%3E2330021547%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2330021547&rft_id=info:pmid/&rft_ieee_id=8910392&rfr_iscdi=true