HMM-based sequence-to-frame mapping for voice conversion

Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Yu Qiao, Saito, Daisuke, Minematsu, N
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 4833
container_issue
container_start_page 4830
container_title
container_volume
creator Yu Qiao
Saito, Daisuke
Minematsu, N
description Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take account of the contextual information existing over a speech sequence. It is well known that HMM yields an efficient method to model the density of a whole speech sequence and has found great successes in speech recognition and synthesis. Inspired by this fact, this paper studies how to use HMM for voice conversion. We derive an HMM-based sequence-to-frame mapping function with statistical analysis. Different from previous HMM-based voice conversion methods that used forced alignment for segmentation and transform frames aligned to a state with its associated linear transformation, our method has a soft mapping function as a weighted summation of linear transformations. The weights are calculated as the HMM posterior probabilities of frames. We also propose and compare two methods to learn the parameters of our mapping functions, namely least square error estimation and maximum likelihood estimation. We carried out experiments to examine the proposed HMM-based method for voice conversion.
doi_str_mv 10.1109/ICASSP.2010.5495141
format Conference Proceeding
fullrecord <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_5495141</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>5495141</ieee_id><sourcerecordid>5495141</sourcerecordid><originalsourceid>FETCH-LOGICAL-i220t-dd751aa14acd43ca095a6ea63ad78598734c2ebecd2db13343eb12d0b16191793</originalsourceid><addsrcrecordid>eNpVkNtKAzEUReMNHGu_oC_zA6k5uUySRylqhRaFKvhWziRnJOJcnNSCf2_Bvvi0YS_YLDZjMxBzAOFvHhe3m83zXIpDYbQ3oOGETb11oKXWWvqqOmWFVNZz8OLt7B8z_pwVYKTgFWh_ya5y_hBCOKtdwdxyveY1Zoplpq9v6gLxXc-bEVsqWxyG1L2XTT-W-z4FKkPf7WnMqe-u2UWDn5mmx5yw1_u7l8WSr54eDrornqQUOx6jNYAIGkPUKqDwBivCSmG0znhnlQ6SagpRxhqU0opqkFHUUIEH69WEzf52ExFthzG1OP5sjyeoX_t7S-Q</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>HMM-based sequence-to-frame mapping for voice conversion</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Yu Qiao ; Saito, Daisuke ; Minematsu, N</creator><creatorcontrib>Yu Qiao ; Saito, Daisuke ; Minematsu, N</creatorcontrib><description>Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take account of the contextual information existing over a speech sequence. It is well known that HMM yields an efficient method to model the density of a whole speech sequence and has found great successes in speech recognition and synthesis. Inspired by this fact, this paper studies how to use HMM for voice conversion. We derive an HMM-based sequence-to-frame mapping function with statistical analysis. Different from previous HMM-based voice conversion methods that used forced alignment for segmentation and transform frames aligned to a state with its associated linear transformation, our method has a soft mapping function as a weighted summation of linear transformations. The weights are calculated as the HMM posterior probabilities of frames. We also propose and compare two methods to learn the parameters of our mapping functions, namely least square error estimation and maximum likelihood estimation. We carried out experiments to examine the proposed HMM-based method for voice conversion.</description><identifier>ISSN: 1520-6149</identifier><identifier>ISBN: 9781424442959</identifier><identifier>ISBN: 1424442958</identifier><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9781424442966</identifier><identifier>EISBN: 1424442966</identifier><identifier>DOI: 10.1109/ICASSP.2010.5495141</identifier><language>eng</language><publisher>IEEE</publisher><subject>Cepstral analysis ; Hidden Markov models ; HMM ; Least squares approximation ; Maximum likelihood estimation ; Probability ; sequence-to-frame mapping ; Speech recognition ; Speech synthesis ; Statistical analysis ; Vectors ; Virtual colonoscopy ; Voice conversion</subject><ispartof>2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, p.4830-4833</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/5495141$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,27902,54895</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/5495141$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Yu Qiao</creatorcontrib><creatorcontrib>Saito, Daisuke</creatorcontrib><creatorcontrib>Minematsu, N</creatorcontrib><title>HMM-based sequence-to-frame mapping for voice conversion</title><title>2010 IEEE International Conference on Acoustics, Speech and Signal Processing</title><addtitle>ICASSP</addtitle><description>Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take account of the contextual information existing over a speech sequence. It is well known that HMM yields an efficient method to model the density of a whole speech sequence and has found great successes in speech recognition and synthesis. Inspired by this fact, this paper studies how to use HMM for voice conversion. We derive an HMM-based sequence-to-frame mapping function with statistical analysis. Different from previous HMM-based voice conversion methods that used forced alignment for segmentation and transform frames aligned to a state with its associated linear transformation, our method has a soft mapping function as a weighted summation of linear transformations. The weights are calculated as the HMM posterior probabilities of frames. We also propose and compare two methods to learn the parameters of our mapping functions, namely least square error estimation and maximum likelihood estimation. We carried out experiments to examine the proposed HMM-based method for voice conversion.</description><subject>Cepstral analysis</subject><subject>Hidden Markov models</subject><subject>HMM</subject><subject>Least squares approximation</subject><subject>Maximum likelihood estimation</subject><subject>Probability</subject><subject>sequence-to-frame mapping</subject><subject>Speech recognition</subject><subject>Speech synthesis</subject><subject>Statistical analysis</subject><subject>Vectors</subject><subject>Virtual colonoscopy</subject><subject>Voice conversion</subject><issn>1520-6149</issn><issn>2379-190X</issn><isbn>9781424442959</isbn><isbn>1424442958</isbn><isbn>9781424442966</isbn><isbn>1424442966</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2010</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNpVkNtKAzEUReMNHGu_oC_zA6k5uUySRylqhRaFKvhWziRnJOJcnNSCf2_Bvvi0YS_YLDZjMxBzAOFvHhe3m83zXIpDYbQ3oOGETb11oKXWWvqqOmWFVNZz8OLt7B8z_pwVYKTgFWh_ya5y_hBCOKtdwdxyveY1Zoplpq9v6gLxXc-bEVsqWxyG1L2XTT-W-z4FKkPf7WnMqe-u2UWDn5mmx5yw1_u7l8WSr54eDrornqQUOx6jNYAIGkPUKqDwBivCSmG0znhnlQ6SagpRxhqU0opqkFHUUIEH69WEzf52ExFthzG1OP5sjyeoX_t7S-Q</recordid><startdate>20100101</startdate><enddate>20100101</enddate><creator>Yu Qiao</creator><creator>Saito, Daisuke</creator><creator>Minematsu, N</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20100101</creationdate><title>HMM-based sequence-to-frame mapping for voice conversion</title><author>Yu Qiao ; Saito, Daisuke ; Minematsu, N</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i220t-dd751aa14acd43ca095a6ea63ad78598734c2ebecd2db13343eb12d0b16191793</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Cepstral analysis</topic><topic>Hidden Markov models</topic><topic>HMM</topic><topic>Least squares approximation</topic><topic>Maximum likelihood estimation</topic><topic>Probability</topic><topic>sequence-to-frame mapping</topic><topic>Speech recognition</topic><topic>Speech synthesis</topic><topic>Statistical analysis</topic><topic>Vectors</topic><topic>Virtual colonoscopy</topic><topic>Voice conversion</topic><toplevel>online_resources</toplevel><creatorcontrib>Yu Qiao</creatorcontrib><creatorcontrib>Saito, Daisuke</creatorcontrib><creatorcontrib>Minematsu, N</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yu Qiao</au><au>Saito, Daisuke</au><au>Minematsu, N</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>HMM-based sequence-to-frame mapping for voice conversion</atitle><btitle>2010 IEEE International Conference on Acoustics, Speech and Signal Processing</btitle><stitle>ICASSP</stitle><date>2010-01-01</date><risdate>2010</risdate><spage>4830</spage><epage>4833</epage><pages>4830-4833</pages><issn>1520-6149</issn><eissn>2379-190X</eissn><isbn>9781424442959</isbn><isbn>1424442958</isbn><eisbn>9781424442966</eisbn><eisbn>1424442966</eisbn><abstract>Voice conversion can be reduced to a problem to find a transformation function between the corresponding speech sequences of two speakers. Perhaps the most voice conversions methods are GMM-based statistical mapping methods. However, the classical GMM-based mapping is frame-to-frame, and cannot take account of the contextual information existing over a speech sequence. It is well known that HMM yields an efficient method to model the density of a whole speech sequence and has found great successes in speech recognition and synthesis. Inspired by this fact, this paper studies how to use HMM for voice conversion. We derive an HMM-based sequence-to-frame mapping function with statistical analysis. Different from previous HMM-based voice conversion methods that used forced alignment for segmentation and transform frames aligned to a state with its associated linear transformation, our method has a soft mapping function as a weighted summation of linear transformations. The weights are calculated as the HMM posterior probabilities of frames. We also propose and compare two methods to learn the parameters of our mapping functions, namely least square error estimation and maximum likelihood estimation. We carried out experiments to examine the proposed HMM-based method for voice conversion.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP.2010.5495141</doi><tpages>4</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1520-6149
ispartof 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, p.4830-4833
issn 1520-6149
2379-190X
language eng
recordid cdi_ieee_primary_5495141
source IEEE Electronic Library (IEL) Conference Proceedings
subjects Cepstral analysis
Hidden Markov models
HMM
Least squares approximation
Maximum likelihood estimation
Probability
sequence-to-frame mapping
Speech recognition
Speech synthesis
Statistical analysis
Vectors
Virtual colonoscopy
Voice conversion
title HMM-based sequence-to-frame mapping for voice conversion
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T09%3A26%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=HMM-based%20sequence-to-frame%20mapping%20for%20voice%20conversion&rft.btitle=2010%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing&rft.au=Yu%20Qiao&rft.date=2010-01-01&rft.spage=4830&rft.epage=4833&rft.pages=4830-4833&rft.issn=1520-6149&rft.eissn=2379-190X&rft.isbn=9781424442959&rft.isbn_list=1424442958&rft_id=info:doi/10.1109/ICASSP.2010.5495141&rft_dat=%3Cieee_6IE%3E5495141%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=9781424442966&rft.eisbn_list=1424442966&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=5495141&rfr_iscdi=true