Exemplar-based voice conversion using joint nonnegative matrix factorization
Exemplar-based sparse representation is a nonparametric framework for voice conversion. In this framework, a target spectrum is generated as a weighted linear combination of a set of basis spectra, namely exemplars, extracted from the training data. This framework adopts coupled source-target dictio...
Gespeichert in:
Veröffentlicht in: | Multimedia tools and applications 2015-11, Vol.74 (22), p.9943-9958 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 9958 |
---|---|
container_issue | 22 |
container_start_page | 9943 |
container_title | Multimedia tools and applications |
container_volume | 74 |
creator | Wu, Zhizheng Chng, Eng Siong Li, Haizhou |
description | Exemplar-based sparse representation is a nonparametric framework for voice conversion. In this framework, a target spectrum is generated as a weighted linear combination of a set of basis spectra, namely exemplars, extracted from the training data. This framework adopts coupled source-target dictionaries consisting of acoustically aligned source-target exemplars, and assumes they can share the same activation matrix. At runtime, a source spectrogram is factorized as a product of the source dictionary and the common activation matrix, which is applied to the target dictionary to generate the target spectrogram. In practice, either low-resolution mel-scale filter bank energies or high-resolution spectra are adopted in the source dictionary. Low-resolution features are flexible in capturing the temporal information without increasing the computational cost and the memory occupation significantly, while high-resolution spectra contain significant spectral details. In this paper, we propose a joint nonnegative matrix factorization technique to find the common activation matrix using low- and high-resolution features at the same time. In this way, the common activation matrix is able to benefit from low- and high-resolution features directly. We conducted experiments on the VOICES database to evaluate the performance of the proposed method. Both objective and subjective evaluations confirmed the effectiveness of the proposed methods. |
doi_str_mv | 10.1007/s11042-014-2180-2 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_1770372903</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3939972491</sourcerecordid><originalsourceid>FETCH-LOGICAL-c485t-d8cc58610c5150af5d76ec16ccbc6d397bc9bc6d604efd10a2f0f3d52c3f8f1c3</originalsourceid><addsrcrecordid>eNp1kE9LAzEUxBdRsFY_gLcFL15W30s2yfYopf6Bghc9hzSblJRtUpPdUv30pqwHETy9YfjN8JiiuEa4QwBxnxChJhVgXRFsoCInxQSZoJUQBE-zptkUDPC8uEhpA4CckXpSLBcHs911KlYrlUxb7oPTptTB701MLvhySM6vy01wvi998N6sVe_2ptyqPrpDaZXuQ3Rf2Qz-sjizqkvm6udOi_fHxdv8uVq-Pr3MH5aVrhvWV22jNWs4gmbIQFnWCm40cq1Xmrd0JlZ6dlQcamNbBEUsWNoyoqltLGo6LW7H3l0MH4NJvdy6pE3XKW_CkCQKAVSQGdCM3vxBN2GIPn-XKY4NrQXnmcKR0jGkFI2Vu-i2Kn5KBHncV477yryvPO4rSc6QMZMy69cm_mr-N_QNQ5p-kg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1761834766</pqid></control><display><type>article</type><title>Exemplar-based voice conversion using joint nonnegative matrix factorization</title><source>SpringerNature Journals</source><creator>Wu, Zhizheng ; Chng, Eng Siong ; Li, Haizhou</creator><creatorcontrib>Wu, Zhizheng ; Chng, Eng Siong ; Li, Haizhou</creatorcontrib><description>Exemplar-based sparse representation is a nonparametric framework for voice conversion. In this framework, a target spectrum is generated as a weighted linear combination of a set of basis spectra, namely exemplars, extracted from the training data. This framework adopts coupled source-target dictionaries consisting of acoustically aligned source-target exemplars, and assumes they can share the same activation matrix. At runtime, a source spectrogram is factorized as a product of the source dictionary and the common activation matrix, which is applied to the target dictionary to generate the target spectrogram. In practice, either low-resolution mel-scale filter bank energies or high-resolution spectra are adopted in the source dictionary. Low-resolution features are flexible in capturing the temporal information without increasing the computational cost and the memory occupation significantly, while high-resolution spectra contain significant spectral details. In this paper, we propose a joint nonnegative matrix factorization technique to find the common activation matrix using low- and high-resolution features at the same time. In this way, the common activation matrix is able to benefit from low- and high-resolution features directly. We conducted experiments on the VOICES database to evaluate the performance of the proposed method. Both objective and subjective evaluations confirmed the effectiveness of the proposed methods.</description><identifier>ISSN: 1380-7501</identifier><identifier>EISSN: 1573-7721</identifier><identifier>DOI: 10.1007/s11042-014-2180-2</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Activation ; Computer Communication Networks ; Computer engineering ; Computer Science ; Conversion ; Data Structures and Information Theory ; Dictionaries ; Factorization ; Matrix ; Multimedia communications ; Multimedia computer applications ; Multimedia Information Systems ; Optimization techniques ; Performance evaluation ; Sparsity ; Speaking ; Special Purpose and Application-Based Systems ; Spectra ; Spectrograms ; Speech ; Statistical analysis ; Voice ; Voice simulation</subject><ispartof>Multimedia tools and applications, 2015-11, Vol.74 (22), p.9943-9958</ispartof><rights>Springer Science+Business Media New York 2014</rights><rights>Springer Science+Business Media New York 2015</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c485t-d8cc58610c5150af5d76ec16ccbc6d397bc9bc6d604efd10a2f0f3d52c3f8f1c3</citedby><cites>FETCH-LOGICAL-c485t-d8cc58610c5150af5d76ec16ccbc6d397bc9bc6d604efd10a2f0f3d52c3f8f1c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11042-014-2180-2$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11042-014-2180-2$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Wu, Zhizheng</creatorcontrib><creatorcontrib>Chng, Eng Siong</creatorcontrib><creatorcontrib>Li, Haizhou</creatorcontrib><title>Exemplar-based voice conversion using joint nonnegative matrix factorization</title><title>Multimedia tools and applications</title><addtitle>Multimed Tools Appl</addtitle><description>Exemplar-based sparse representation is a nonparametric framework for voice conversion. In this framework, a target spectrum is generated as a weighted linear combination of a set of basis spectra, namely exemplars, extracted from the training data. This framework adopts coupled source-target dictionaries consisting of acoustically aligned source-target exemplars, and assumes they can share the same activation matrix. At runtime, a source spectrogram is factorized as a product of the source dictionary and the common activation matrix, which is applied to the target dictionary to generate the target spectrogram. In practice, either low-resolution mel-scale filter bank energies or high-resolution spectra are adopted in the source dictionary. Low-resolution features are flexible in capturing the temporal information without increasing the computational cost and the memory occupation significantly, while high-resolution spectra contain significant spectral details. In this paper, we propose a joint nonnegative matrix factorization technique to find the common activation matrix using low- and high-resolution features at the same time. In this way, the common activation matrix is able to benefit from low- and high-resolution features directly. We conducted experiments on the VOICES database to evaluate the performance of the proposed method. Both objective and subjective evaluations confirmed the effectiveness of the proposed methods.</description><subject>Activation</subject><subject>Computer Communication Networks</subject><subject>Computer engineering</subject><subject>Computer Science</subject><subject>Conversion</subject><subject>Data Structures and Information Theory</subject><subject>Dictionaries</subject><subject>Factorization</subject><subject>Matrix</subject><subject>Multimedia communications</subject><subject>Multimedia computer applications</subject><subject>Multimedia Information Systems</subject><subject>Optimization techniques</subject><subject>Performance evaluation</subject><subject>Sparsity</subject><subject>Speaking</subject><subject>Special Purpose and Application-Based Systems</subject><subject>Spectra</subject><subject>Spectrograms</subject><subject>Speech</subject><subject>Statistical analysis</subject><subject>Voice</subject><subject>Voice simulation</subject><issn>1380-7501</issn><issn>1573-7721</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><sourceid>8G5</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>GUQSH</sourceid><sourceid>M2O</sourceid><recordid>eNp1kE9LAzEUxBdRsFY_gLcFL15W30s2yfYopf6Bghc9hzSblJRtUpPdUv30pqwHETy9YfjN8JiiuEa4QwBxnxChJhVgXRFsoCInxQSZoJUQBE-zptkUDPC8uEhpA4CckXpSLBcHs911KlYrlUxb7oPTptTB701MLvhySM6vy01wvi998N6sVe_2ptyqPrpDaZXuQ3Rf2Qz-sjizqkvm6udOi_fHxdv8uVq-Pr3MH5aVrhvWV22jNWs4gmbIQFnWCm40cq1Xmrd0JlZ6dlQcamNbBEUsWNoyoqltLGo6LW7H3l0MH4NJvdy6pE3XKW_CkCQKAVSQGdCM3vxBN2GIPn-XKY4NrQXnmcKR0jGkFI2Vu-i2Kn5KBHncV477yryvPO4rSc6QMZMy69cm_mr-N_QNQ5p-kg</recordid><startdate>20151101</startdate><enddate>20151101</enddate><creator>Wu, Zhizheng</creator><creator>Chng, Eng Siong</creator><creator>Li, Haizhou</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7SC</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>8G5</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>GUQSH</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0C</scope><scope>M0N</scope><scope>M2O</scope><scope>MBDVC</scope><scope>P5Z</scope><scope>P62</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope></search><sort><creationdate>20151101</creationdate><title>Exemplar-based voice conversion using joint nonnegative matrix factorization</title><author>Wu, Zhizheng ; Chng, Eng Siong ; Li, Haizhou</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c485t-d8cc58610c5150af5d76ec16ccbc6d397bc9bc6d604efd10a2f0f3d52c3f8f1c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Activation</topic><topic>Computer Communication Networks</topic><topic>Computer engineering</topic><topic>Computer Science</topic><topic>Conversion</topic><topic>Data Structures and Information Theory</topic><topic>Dictionaries</topic><topic>Factorization</topic><topic>Matrix</topic><topic>Multimedia communications</topic><topic>Multimedia computer applications</topic><topic>Multimedia Information Systems</topic><topic>Optimization techniques</topic><topic>Performance evaluation</topic><topic>Sparsity</topic><topic>Speaking</topic><topic>Special Purpose and Application-Based Systems</topic><topic>Spectra</topic><topic>Spectrograms</topic><topic>Speech</topic><topic>Statistical analysis</topic><topic>Voice</topic><topic>Voice simulation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wu, Zhizheng</creatorcontrib><creatorcontrib>Chng, Eng Siong</creatorcontrib><creatorcontrib>Li, Haizhou</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>Access via ABI/INFORM (ProQuest)</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Global (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>Research Library (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Business Premium Collection</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>Research Library Prep</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>Research Library</collection><collection>Research Library (Corporate)</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Business</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>Multimedia tools and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wu, Zhizheng</au><au>Chng, Eng Siong</au><au>Li, Haizhou</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Exemplar-based voice conversion using joint nonnegative matrix factorization</atitle><jtitle>Multimedia tools and applications</jtitle><stitle>Multimed Tools Appl</stitle><date>2015-11-01</date><risdate>2015</risdate><volume>74</volume><issue>22</issue><spage>9943</spage><epage>9958</epage><pages>9943-9958</pages><issn>1380-7501</issn><eissn>1573-7721</eissn><abstract>Exemplar-based sparse representation is a nonparametric framework for voice conversion. In this framework, a target spectrum is generated as a weighted linear combination of a set of basis spectra, namely exemplars, extracted from the training data. This framework adopts coupled source-target dictionaries consisting of acoustically aligned source-target exemplars, and assumes they can share the same activation matrix. At runtime, a source spectrogram is factorized as a product of the source dictionary and the common activation matrix, which is applied to the target dictionary to generate the target spectrogram. In practice, either low-resolution mel-scale filter bank energies or high-resolution spectra are adopted in the source dictionary. Low-resolution features are flexible in capturing the temporal information without increasing the computational cost and the memory occupation significantly, while high-resolution spectra contain significant spectral details. In this paper, we propose a joint nonnegative matrix factorization technique to find the common activation matrix using low- and high-resolution features at the same time. In this way, the common activation matrix is able to benefit from low- and high-resolution features directly. We conducted experiments on the VOICES database to evaluate the performance of the proposed method. Both objective and subjective evaluations confirmed the effectiveness of the proposed methods.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11042-014-2180-2</doi><tpages>16</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1380-7501 |
ispartof | Multimedia tools and applications, 2015-11, Vol.74 (22), p.9943-9958 |
issn | 1380-7501 1573-7721 |
language | eng |
recordid | cdi_proquest_miscellaneous_1770372903 |
source | SpringerNature Journals |
subjects | Activation Computer Communication Networks Computer engineering Computer Science Conversion Data Structures and Information Theory Dictionaries Factorization Matrix Multimedia communications Multimedia computer applications Multimedia Information Systems Optimization techniques Performance evaluation Sparsity Speaking Special Purpose and Application-Based Systems Spectra Spectrograms Speech Statistical analysis Voice Voice simulation |
title | Exemplar-based voice conversion using joint nonnegative matrix factorization |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-23T03%3A13%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Exemplar-based%20voice%20conversion%20using%20joint%20nonnegative%20matrix%20factorization&rft.jtitle=Multimedia%20tools%20and%20applications&rft.au=Wu,%20Zhizheng&rft.date=2015-11-01&rft.volume=74&rft.issue=22&rft.spage=9943&rft.epage=9958&rft.pages=9943-9958&rft.issn=1380-7501&rft.eissn=1573-7721&rft_id=info:doi/10.1007/s11042-014-2180-2&rft_dat=%3Cproquest_cross%3E3939972491%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1761834766&rft_id=info:pmid/&rfr_iscdi=true |