HMM-based sequence-to-frame mapping for voice conversion
Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 4833 |
---|---|
container_issue | |
container_start_page | 4830 |
container_title | |
container_volume | |
creator | Yu Qiao Saito, Daisuke Minematsu, N |
description | Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take account of the contextual information existing over a speech sequence. It is well known that HMM yields an efficient method to model the density of a whole speech sequence and has found great successes in speech recognition and synthesis. Inspired by this fact, this paper studies how to use HMM for voice conversion. We derive an HMM-based sequence-to-frame mapping function with statistical analysis. Different from previous HMM-based voice conversion methods that used forced alignment for segmentation and transform frames aligned to a state with its associated linear transformation, our method has a soft mapping function as a weighted summation of linear transformations. The weights are calculated as the HMM posterior probabilities of frames. We also propose and compare two methods to learn the parameters of our mapping functions, namely least square error estimation and maximum likelihood estimation. We carried out experiments to examine the proposed HMM-based method for voice conversion. |
doi_str_mv | 10.1109/ICASSP.2010.5495141 |
format | Conference Proceeding |
fullrecord | <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_5495141</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5495141</ieee_id><sourcerecordid>5495141</sourcerecordid><originalsourceid>FETCH-LOGICAL-i220t-dd751aa14acd43ca095a6ea63ad78598734c2ebecd2db13343eb12d0b16191793</originalsourceid><addsrcrecordid>eNpVkNtKAzEUReMNHGu_oC_zA6k5uUySRylqhRaFKvhWziRnJOJcnNSCf2_Bvvi0YS_YLDZjMxBzAOFvHhe3m83zXIpDYbQ3oOGETb11oKXWWvqqOmWFVNZz8OLt7B8z_pwVYKTgFWh_ya5y_hBCOKtdwdxyveY1Zoplpq9v6gLxXc-bEVsqWxyG1L2XTT-W-z4FKkPf7WnMqe-u2UWDn5mmx5yw1_u7l8WSr54eDrornqQUOx6jNYAIGkPUKqDwBivCSmG0znhnlQ6SagpRxhqU0opqkFHUUIEH69WEzf52ExFthzG1OP5sjyeoX_t7S-Q</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>HMM-based sequence-to-frame mapping for voice conversion</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Yu Qiao ; Saito, Daisuke ; Minematsu, N</creator><creatorcontrib>Yu Qiao ; Saito, Daisuke ; Minematsu, N</creatorcontrib><description>Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take account of the contextual information existing over a speech sequence. It is well known that HMM yields an efficient method to model the density of a whole speech sequence and has found great successes in speech recognition and synthesis. Inspired by this fact, this paper studies how to use HMM for voice conversion. We derive an HMM-based sequence-to-frame mapping function with statistical analysis. Different from previous HMM-based voice conversion methods that used forced alignment for segmentation and transform frames aligned to a state with its associated linear transformation, our method has a soft mapping function as a weighted summation of linear transformations. The weights are calculated as the HMM posterior probabilities of frames. We also propose and compare two methods to learn the parameters of our mapping functions, namely least square error estimation and maximum likelihood estimation. We carried out experiments to examine the proposed HMM-based method for voice conversion.</description><identifier>ISSN: 1520-6149</identifier><identifier>ISBN: 9781424442959</identifier><identifier>ISBN: 1424442958</identifier><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9781424442966</identifier><identifier>EISBN: 1424442966</identifier><identifier>DOI: 10.1109/ICASSP.2010.5495141</identifier><language>eng</language><publisher>IEEE</publisher><subject>Cepstral analysis ; Hidden Markov models ; HMM ; Least squares approximation ; Maximum likelihood estimation ; Probability ; sequence-to-frame mapping ; Speech recognition ; Speech synthesis ; Statistical analysis ; Vectors ; Virtual colonoscopy ; Voice conversion</subject><ispartof>2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, p.4830-4833</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5495141$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5495141$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Yu Qiao</creatorcontrib><creatorcontrib>Saito, Daisuke</creatorcontrib><creatorcontrib>Minematsu, N</creatorcontrib><title>HMM-based sequence-to-frame mapping for voice conversion</title><title>2010 IEEE International Conference on Acoustics, Speech and Signal Processing</title><addtitle>ICASSP</addtitle><description>Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take account of the contextual information existing over a speech sequence. It is well known that HMM yields an efficient method to model the density of a whole speech sequence and has found great successes in speech recognition and synthesis. Inspired by this fact, this paper studies how to use HMM for voice conversion. We derive an HMM-based sequence-to-frame mapping function with statistical analysis. Different from previous HMM-based voice conversion methods that used forced alignment for segmentation and transform frames aligned to a state with its associated linear transformation, our method has a soft mapping function as a weighted summation of linear transformations. The weights are calculated as the HMM posterior probabilities of frames. We also propose and compare two methods to learn the parameters of our mapping functions, namely least square error estimation and maximum likelihood estimation. We carried out experiments to examine the proposed HMM-based method for voice conversion.</description><subject>Cepstral analysis</subject><subject>Hidden Markov models</subject><subject>HMM</subject><subject>Least squares approximation</subject><subject>Maximum likelihood estimation</subject><subject>Probability</subject><subject>sequence-to-frame mapping</subject><subject>Speech recognition</subject><subject>Speech synthesis</subject><subject>Statistical analysis</subject><subject>Vectors</subject><subject>Virtual colonoscopy</subject><subject>Voice conversion</subject><issn>1520-6149</issn><issn>2379-190X</issn><isbn>9781424442959</isbn><isbn>1424442958</isbn><isbn>9781424442966</isbn><isbn>1424442966</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2010</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNpVkNtKAzEUReMNHGu_oC_zA6k5uUySRylqhRaFKvhWziRnJOJcnNSCf2_Bvvi0YS_YLDZjMxBzAOFvHhe3m83zXIpDYbQ3oOGETb11oKXWWvqqOmWFVNZz8OLt7B8z_pwVYKTgFWh_ya5y_hBCOKtdwdxyveY1Zoplpq9v6gLxXc-bEVsqWxyG1L2XTT-W-z4FKkPf7WnMqe-u2UWDn5mmx5yw1_u7l8WSr54eDrornqQUOx6jNYAIGkPUKqDwBivCSmG0znhnlQ6SagpRxhqU0opqkFHUUIEH69WEzf52ExFthzG1OP5sjyeoX_t7S-Q</recordid><startdate>20100101</startdate><enddate>20100101</enddate><creator>Yu Qiao</creator><creator>Saito, Daisuke</creator><creator>Minematsu, N</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20100101</creationdate><title>HMM-based sequence-to-frame mapping for voice conversion</title><author>Yu Qiao ; Saito, Daisuke ; Minematsu, N</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i220t-dd751aa14acd43ca095a6ea63ad78598734c2ebecd2db13343eb12d0b16191793</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Cepstral analysis</topic><topic>Hidden Markov models</topic><topic>HMM</topic><topic>Least squares approximation</topic><topic>Maximum likelihood estimation</topic><topic>Probability</topic><topic>sequence-to-frame mapping</topic><topic>Speech recognition</topic><topic>Speech synthesis</topic><topic>Statistical analysis</topic><topic>Vectors</topic><topic>Virtual colonoscopy</topic><topic>Voice conversion</topic><toplevel>online_resources</toplevel><creatorcontrib>Yu Qiao</creatorcontrib><creatorcontrib>Saito, Daisuke</creatorcontrib><creatorcontrib>Minematsu, N</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yu Qiao</au><au>Saito, Daisuke</au><au>Minematsu, N</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>HMM-based sequence-to-frame mapping for voice conversion</atitle><btitle>2010 IEEE International Conference on Acoustics, Speech and Signal Processing</btitle><stitle>ICASSP</stitle><date>2010-01-01</date><risdate>2010</risdate><spage>4830</spage><epage>4833</epage><pages>4830-4833</pages><issn>1520-6149</issn><eissn>2379-190X</eissn><isbn>9781424442959</isbn><isbn>1424442958</isbn><eisbn>9781424442966</eisbn><eisbn>1424442966</eisbn><abstract>Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take account of the contextual information existing over a speech sequence. It is well known that HMM yields an efficient method to model the density of a whole speech sequence and has found great successes in speech recognition and synthesis. Inspired by this fact, this paper studies how to use HMM for voice conversion. We derive an HMM-based sequence-to-frame mapping function with statistical analysis. Different from previous HMM-based voice conversion methods that used forced alignment for segmentation and transform frames aligned to a state with its associated linear transformation, our method has a soft mapping function as a weighted summation of linear transformations. The weights are calculated as the HMM posterior probabilities of frames. We also propose and compare two methods to learn the parameters of our mapping functions, namely least square error estimation and maximum likelihood estimation. We carried out experiments to examine the proposed HMM-based method for voice conversion.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP.2010.5495141</doi><tpages>4</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1520-6149 |
ispartof | 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, p.4830-4833 |
issn | 1520-6149 2379-190X |
language | eng |
recordid | cdi_ieee_primary_5495141 |
source | IEEE Electronic Library (IEL) Conference Proceedings |
subjects | Cepstral analysis Hidden Markov models HMM Least squares approximation Maximum likelihood estimation Probability sequence-to-frame mapping Speech recognition Speech synthesis Statistical analysis Vectors Virtual colonoscopy Voice conversion |
title | HMM-based sequence-to-frame mapping for voice conversion |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T09%3A26%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=HMM-based%20sequence-to-frame%20mapping%20for%20voice%20conversion&rft.btitle=2010%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing&rft.au=Yu%20Qiao&rft.date=2010-01-01&rft.spage=4830&rft.epage=4833&rft.pages=4830-4833&rft.issn=1520-6149&rft.eissn=2379-190X&rft.isbn=9781424442959&rft.isbn_list=1424442958&rft_id=info:doi/10.1109/ICASSP.2010.5495141&rft_dat=%3Cieee_6IE%3E5495141%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=9781424442966&rft.eisbn_list=1424442966&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=5495141&rfr_iscdi=true |