HMM-based sequence-to-frame mapping for voice conversion

Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Yu Qiao, Saito, Daisuke, Minematsu, N
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Cepstral analysis Hidden Markov models HMM Least squares approximation Maximum likelihood estimation Probability sequence-to-frame mapping Speech recognition Speech synthesis Statistical analysis Vectors Virtual colonoscopy Voice conversion
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	4833
container_issue
container_start_page	4830
container_title
container_volume
creator	Yu Qiao Saito, Daisuke Minematsu, N
description	Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take account of the contextual information existing over a speech sequence. It is well known that HMM yields an efficient method to model the density of a whole speech sequence and has found great successes in speech recognition and synthesis. Inspired by this fact, this paper studies how to use HMM for voice conversion. We derive an HMM-based sequence-to-frame mapping function with statistical analysis. Different from previous HMM-based voice conversion methods that used forced alignment for segmentation and transform frames aligned to a state with its associated linear transformation, our method has a soft mapping function as a weighted summation of linear transformations. The weights are calculated as the HMM posterior probabilities of frames. We also propose and compare two methods to learn the parameters of our mapping functions, namely least square error estimation and maximum likelihood estimation. We carried out experiments to examine the proposed HMM-based method for voice conversion.
doi_str_mv	10.1109/ICASSP.2010.5495141
format	Conference Proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_5495141</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5495141</ieee_id><sourcerecordid>5495141</sourcerecordid><originalsourceid>FETCH-LOGICAL-i220t-dd751aa14acd43ca095a6ea63ad78598734c2ebecd2db13343eb12d0b16191793</originalsourceid><addsrcrecordid>eNpVkNtKAzEUReMNHGu_oC_zA6k5uUySRylqhRaFKvhWziRnJOJcnNSCf2_Bvvi0YS_YLDZjMxBzAOFvHhe3m83zXIpDYbQ3oOGETb11oKXWWvqqOmWFVNZz8OLt7B8z_pwVYKTgFWh_ya5y_hBCOKtdwdxyveY1Zoplpq9v6gLxXc-bEVsqWxyG1L2XTT-W-z4FKkPf7WnMqe-u2UWDn5mmx5yw1_u7l8WSr54eDrornqQUOx6jNYAIGkPUKqDwBivCSmG0znhnlQ6SagpRxhqU0opqkFHUUIEH69WEzf52ExFthzG1OP5sjyeoX_t7S-Q</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>HMM-based sequence-to-frame mapping for voice conversion</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Yu Qiao ; Saito, Daisuke ; Minematsu, N</creator><creatorcontrib>Yu Qiao ; Saito, Daisuke ; Minematsu, N</creatorcontrib><description>Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take account of the contextual information existing over a speech sequence. It is well known that HMM yields an efficient method to model the density of a whole speech sequence and has found great successes in speech recognition and synthesis. Inspired by this fact, this paper studies how to use HMM for voice conversion. We derive an HMM-based sequence-to-frame mapping function with statistical analysis. Different from previous HMM-based voice conversion methods that used forced alignment for segmentation and transform frames aligned to a state with its associated linear transformation, our method has a soft mapping function as a weighted summation of linear transformations. The weights are calculated as the HMM posterior probabilities of frames. We also propose and compare two methods to learn the parameters of our mapping functions, namely least square error estimation and maximum likelihood estimation. We carried out experiments to examine the proposed HMM-based method for voice conversion.</description><identifier>ISSN: 1520-6149</identifier><identifier>ISBN: 9781424442959</identifier><identifier>ISBN: 1424442958</identifier><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9781424442966</identifier><identifier>EISBN: 1424442966</identifier><identifier>DOI: 10.1109/ICASSP.2010.5495141</identifier><language>eng</language><publisher>IEEE</publisher><subject>Cepstral analysis ; Hidden Markov models ; HMM ; Least squares approximation ; Maximum likelihood estimation ; Probability ; sequence-to-frame mapping ; Speech recognition ; Speech synthesis ; Statistical analysis ; Vectors ; Virtual colonoscopy ; Voice conversion</subject><ispartof>2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, p.4830-4833</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5495141$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5495141$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Yu Qiao</creatorcontrib><creatorcontrib>Saito, Daisuke</creatorcontrib><creatorcontrib>Minematsu, N</creatorcontrib><title>HMM-based sequence-to-frame mapping for voice conversion</title><title>2010 IEEE International Conference on Acoustics, Speech and Signal Processing</title><addtitle>ICASSP</addtitle><description>Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take account of the contextual information existing over a speech sequence. It is well known that HMM yields an efficient method to model the density of a whole speech sequence and has found great successes in speech recognition and synthesis. Inspired by this fact, this paper studies how to use HMM for voice conversion. We derive an HMM-based sequence-to-frame mapping function with statistical analysis. Different from previous HMM-based voice conversion methods that used forced alignment for segmentation and transform frames aligned to a state with its associated linear transformation, our method has a soft mapping function as a weighted summation of linear transformations. The weights are calculated as the HMM posterior probabilities of frames. We also propose and compare two methods to learn the parameters of our mapping functions, namely least square error estimation and maximum likelihood estimation. We carried out experiments to examine the proposed HMM-based method for voice conversion.</description><subject>Cepstral analysis</subject><subject>Hidden Markov models</subject><subject>HMM</subject><subject>Least squares approximation</subject><subject>Maximum likelihood estimation</subject><subject>Probability</subject><subject>sequence-to-frame mapping</subject><subject>Speech recognition</subject><subject>Speech synthesis</subject><subject>Statistical analysis</subject><subject>Vectors</subject><subject>Virtual colonoscopy</subject><subject>Voice conversion</subject><issn>1520-6149</issn><issn>2379-190X</issn><isbn>9781424442959</isbn><isbn>1424442958</isbn><isbn>9781424442966</isbn><isbn>1424442966</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2010</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNpVkNtKAzEUReMNHGu_oC_zA6k5uUySRylqhRaFKvhWziRnJOJcnNSCf2_Bvvi0YS_YLDZjMxBzAOFvHhe3m83zXIpDYbQ3oOGETb11oKXWWvqqOmWFVNZz8OLt7B8z_pwVYKTgFWh_ya5y_hBCOKtdwdxyveY1Zoplpq9v6gLxXc-bEVsqWxyG1L2XTT-W-z4FKkPf7WnMqe-u2UWDn5mmx5yw1_u7l8WSr54eDrornqQUOx6jNYAIGkPUKqDwBivCSmG0znhnlQ6SagpRxhqU0opqkFHUUIEH69WEzf52ExFthzG1OP5sjyeoX_t7S-Q</recordid><startdate>20100101</startdate><enddate>20100101</enddate><creator>Yu Qiao</creator><creator>Saito, Daisuke</creator><creator>Minematsu, N</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20100101</creationdate><title>HMM-based sequence-to-frame mapping for voice conversion</title><author>Yu Qiao ; Saito, Daisuke ; Minematsu, N</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i220t-dd751aa14acd43ca095a6ea63ad78598734c2ebecd2db13343eb12d0b16191793</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Cepstral analysis</topic><topic>Hidden Markov models</topic><topic>HMM</topic><topic>Least squares approximation</topic><topic>Maximum likelihood estimation</topic><topic>Probability</topic><topic>sequence-to-frame mapping</topic><topic>Speech recognition</topic><topic>Speech synthesis</topic><topic>Statistical analysis</topic><topic>Vectors</topic><topic>Virtual colonoscopy</topic><topic>Voice conversion</topic><toplevel>online_resources</toplevel><creatorcontrib>Yu Qiao</creatorcontrib><creatorcontrib>Saito, Daisuke</creatorcontrib><creatorcontrib>Minematsu, N</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yu Qiao</au><au>Saito, Daisuke</au><au>Minematsu, N</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>HMM-based sequence-to-frame mapping for voice conversion</atitle><btitle>2010 IEEE International Conference on Acoustics, Speech and Signal Processing</btitle><stitle>ICASSP</stitle><date>2010-01-01</date><risdate>2010</risdate><spage>4830</spage><epage>4833</epage><pages>4830-4833</pages><issn>1520-6149</issn><eissn>2379-190X</eissn><isbn>9781424442959</isbn><isbn>1424442958</isbn><eisbn>9781424442966</eisbn><eisbn>1424442966</eisbn><abstract>Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take account of the contextual information existing over a speech sequence. It is well known that HMM yields an efficient method to model the density of a whole speech sequence and has found great successes in speech recognition and synthesis. Inspired by this fact, this paper studies how to use HMM for voice conversion. We derive an HMM-based sequence-to-frame mapping function with statistical analysis. Different from previous HMM-based voice conversion methods that used forced alignment for segmentation and transform frames aligned to a state with its associated linear transformation, our method has a soft mapping function as a weighted summation of linear transformations. The weights are calculated as the HMM posterior probabilities of frames. We also propose and compare two methods to learn the parameters of our mapping functions, namely least square error estimation and maximum likelihood estimation. We carried out experiments to examine the proposed HMM-based method for voice conversion.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP.2010.5495141</doi><tpages>4</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1520-6149
ispartof	2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, p.4830-4833
issn	1520-6149 2379-190X
language	eng
recordid	cdi_ieee_primary_5495141
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	Cepstral analysis Hidden Markov models HMM Least squares approximation Maximum likelihood estimation Probability sequence-to-frame mapping Speech recognition Speech synthesis Statistical analysis Vectors Virtual colonoscopy Voice conversion
title	HMM-based sequence-to-frame mapping for voice conversion
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T09%3A26%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=HMM-based%20sequence-to-frame%20mapping%20for%20voice%20conversion&rft.btitle=2010%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing&rft.au=Yu%20Qiao&rft.date=2010-01-01&rft.spage=4830&rft.epage=4833&rft.pages=4830-4833&rft.issn=1520-6149&rft.eissn=2379-190X&rft.isbn=9781424442959&rft.isbn_list=1424442958&rft_id=info:doi/10.1109/ICASSP.2010.5495141&rft_dat=%3Cieee_6IE%3E5495141%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=9781424442966&rft.eisbn_list=1424442966&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=5495141&rfr_iscdi=true