Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features

In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentia...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2014-10, Vol.22 (10), p.1558-1570
Hauptverfasser:	Chen, Chia-Ping, Huang, Yi-Chin, Wu, Chung-Hsien, Lee, Kuan-De
Format:	Artikel
Sprache:	eng
Schlagworte:	Adaptation models Articulatory features auditory features cross-lingual frame selection Feature extraction Hidden Markov models IEEE transactions polyglot speech synthesis Speech Speech synthesis
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1570
container_issue	10
container_start_page	1558
container_title	IEEE/ACM transactions on audio, speech, and language processing
container_volume	22
creator	Chen, Chia-Ping Huang, Yi-Chin Wu, Chung-Hsien Lee, Kuan-De
description	In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentially, a set of artificial utterances in the second language for a target speaker is constructed based on the proposed cross-lingual frame-selection process, and this data set is used to adapt a synthesis model in the second language to the speaker. In the cross-lingual frame-selection process, we propose to use auditory and articulatory features to improve the quality of the synthesized polyglot speech. For evaluation, a Mandarin-English polyglot system is implemented where the target speaker only speaks Mandarin. The results show that decent performance regarding voice identity and speech quality can be achieved with the proposed method.
doi_str_mv	10.1109/TASLP.2014.2339738
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_1564751919</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6857339</ieee_id><sourcerecordid>3442342291</sourcerecordid><originalsourceid>FETCH-LOGICAL-c295t-d60fdbbf87b9fcd60937b657f630adecd4d454e380a712aa1e1762266f99122c3</originalsourceid><addsrcrecordid>eNo9kMtOwzAQRS0EEhX0B2BjiXWKH4kdL0tFASkSldKuIyeetK7SpNjOIn9P-oDVvO6d0RyEniiZUUrU63qeZ6sZIzSeMc6V5OkNmjDOVKQ4iW__cqbIPZp6vyeEUCKVkvEE2VXXDNumCzg_AlQ7nA9t2IG3Hr9pDwZ3LV64zvsos-221w1eOn0AnEMDVbDjdOPHAZ73xobODVi3Bs9dsFXf6HNjCTr0Dvwjuqt142F6jQ9os3xfLz6j7PvjazHPooqpJERGkNqUZZ3KUtXVWCkuS5HIWnCiDVQmNnESA0-JlpRpTYFKwZgQtVKUsYo_oJfL3qPrfnrwodh3vWvHkwVNRCwTqqgaVeyiqk7POaiLo7MH7YaCkuJEtThTLU5UiyvV0fR8MVkA-DeINJGjgP8CF9F0Xw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1564751919</pqid></control><display><type>article</type><title>Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features</title><source>IEEE Electronic Library (IEL)</source><creator>Chen, Chia-Ping ; Huang, Yi-Chin ; Wu, Chung-Hsien ; Lee, Kuan-De</creator><creatorcontrib>Chen, Chia-Ping ; Huang, Yi-Chin ; Wu, Chung-Hsien ; Lee, Kuan-De</creatorcontrib><description>In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentially, a set of artificial utterances in the second language for a target speaker is constructed based on the proposed cross-lingual frame-selection process, and this data set is used to adapt a synthesis model in the second language to the speaker. In the cross-lingual frame-selection process, we propose to use auditory and articulatory features to improve the quality of the synthesized polyglot speech. For evaluation, a Mandarin-English polyglot system is implemented where the target speaker only speaks Mandarin. The results show that decent performance regarding voice identity and speech quality can be achieved with the proposed method.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2014.2339738</identifier><identifier>CODEN: ITASD8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Adaptation models ; Articulatory features ; auditory features ; cross-lingual frame selection ; Feature extraction ; Hidden Markov models ; IEEE transactions ; polyglot speech synthesis ; Speech ; Speech synthesis</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2014-10, Vol.22 (10), p.1558-1570</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Oct 2014</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c295t-d60fdbbf87b9fcd60937b657f630adecd4d454e380a712aa1e1762266f99122c3</citedby><cites>FETCH-LOGICAL-c295t-d60fdbbf87b9fcd60937b657f630adecd4d454e380a712aa1e1762266f99122c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6857339$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6857339$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Chen, Chia-Ping</creatorcontrib><creatorcontrib>Huang, Yi-Chin</creatorcontrib><creatorcontrib>Wu, Chung-Hsien</creatorcontrib><creatorcontrib>Lee, Kuan-De</creatorcontrib><title>Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentially, a set of artificial utterances in the second language for a target speaker is constructed based on the proposed cross-lingual frame-selection process, and this data set is used to adapt a synthesis model in the second language to the speaker. In the cross-lingual frame-selection process, we propose to use auditory and articulatory features to improve the quality of the synthesized polyglot speech. For evaluation, a Mandarin-English polyglot system is implemented where the target speaker only speaks Mandarin. The results show that decent performance regarding voice identity and speech quality can be achieved with the proposed method.</description><subject>Adaptation models</subject><subject>Articulatory features</subject><subject>auditory features</subject><subject>cross-lingual frame selection</subject><subject>Feature extraction</subject><subject>Hidden Markov models</subject><subject>IEEE transactions</subject><subject>polyglot speech synthesis</subject><subject>Speech</subject><subject>Speech synthesis</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kMtOwzAQRS0EEhX0B2BjiXWKH4kdL0tFASkSldKuIyeetK7SpNjOIn9P-oDVvO6d0RyEniiZUUrU63qeZ6sZIzSeMc6V5OkNmjDOVKQ4iW__cqbIPZp6vyeEUCKVkvEE2VXXDNumCzg_AlQ7nA9t2IG3Hr9pDwZ3LV64zvsos-221w1eOn0AnEMDVbDjdOPHAZ73xobODVi3Bs9dsFXf6HNjCTr0Dvwjuqt142F6jQ9os3xfLz6j7PvjazHPooqpJERGkNqUZZ3KUtXVWCkuS5HIWnCiDVQmNnESA0-JlpRpTYFKwZgQtVKUsYo_oJfL3qPrfnrwodh3vWvHkwVNRCwTqqgaVeyiqk7POaiLo7MH7YaCkuJEtThTLU5UiyvV0fR8MVkA-DeINJGjgP8CF9F0Xw</recordid><startdate>201410</startdate><enddate>201410</enddate><creator>Chen, Chia-Ping</creator><creator>Huang, Yi-Chin</creator><creator>Wu, Chung-Hsien</creator><creator>Lee, Kuan-De</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>201410</creationdate><title>Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features</title><author>Chen, Chia-Ping ; Huang, Yi-Chin ; Wu, Chung-Hsien ; Lee, Kuan-De</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c295t-d60fdbbf87b9fcd60937b657f630adecd4d454e380a712aa1e1762266f99122c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Adaptation models</topic><topic>Articulatory features</topic><topic>auditory features</topic><topic>cross-lingual frame selection</topic><topic>Feature extraction</topic><topic>Hidden Markov models</topic><topic>IEEE transactions</topic><topic>polyglot speech synthesis</topic><topic>Speech</topic><topic>Speech synthesis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Chen, Chia-Ping</creatorcontrib><creatorcontrib>Huang, Yi-Chin</creatorcontrib><creatorcontrib>Wu, Chung-Hsien</creatorcontrib><creatorcontrib>Lee, Kuan-De</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Chia-Ping</au><au>Huang, Yi-Chin</au><au>Wu, Chung-Hsien</au><au>Lee, Kuan-De</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2014-10</date><risdate>2014</risdate><volume>22</volume><issue>10</issue><spage>1558</spage><epage>1570</epage><pages>1558-1570</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASD8</coden><abstract>In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentially, a set of artificial utterances in the second language for a target speaker is constructed based on the proposed cross-lingual frame-selection process, and this data set is used to adapt a synthesis model in the second language to the speaker. In the cross-lingual frame-selection process, we propose to use auditory and articulatory features to improve the quality of the synthesized polyglot speech. For evaluation, a Mandarin-English polyglot system is implemented where the target speaker only speaks Mandarin. The results show that decent performance regarding voice identity and speech quality can be achieved with the proposed method.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2014.2339738</doi><tpages>13</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2329-9290
ispartof	IEEE/ACM transactions on audio, speech, and language processing, 2014-10, Vol.22 (10), p.1558-1570
issn	2329-9290 2329-9304
language	eng
recordid	cdi_proquest_journals_1564751919
source	IEEE Electronic Library (IEL)
subjects	Adaptation models Articulatory features auditory features cross-lingual frame selection Feature extraction Hidden Markov models IEEE transactions polyglot speech synthesis Speech Speech synthesis
title	Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T05%3A54%3A04IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Polyglot%20Speech%20Synthesis%20Based%20on%20Cross-Lingual%20Frame%20Selection%20Using%20Auditory%20and%20Articulatory%20Features&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Chen,%20Chia-Ping&rft.date=2014-10&rft.volume=22&rft.issue=10&rft.spage=1558&rft.epage=1570&rft.pages=1558-1570&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASD8&rft_id=info:doi/10.1109/TASLP.2014.2339738&rft_dat=%3Cproquest_RIE%3E3442342291%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1564751919&rft_id=info:pmid/&rft_ieee_id=6857339&rfr_iscdi=true