Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features

In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentia...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2014-10, Vol.22 (10), p.1558-1570
Hauptverfasser: Chen, Chia-Ping, Huang, Yi-Chin, Wu, Chung-Hsien, Lee, Kuan-De
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1570
container_issue 10
container_start_page 1558
container_title IEEE/ACM transactions on audio, speech, and language processing
container_volume 22
creator Chen, Chia-Ping
Huang, Yi-Chin
Wu, Chung-Hsien
Lee, Kuan-De
description In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentially, a set of artificial utterances in the second language for a target speaker is constructed based on the proposed cross-lingual frame-selection process, and this data set is used to adapt a synthesis model in the second language to the speaker. In the cross-lingual frame-selection process, we propose to use auditory and articulatory features to improve the quality of the synthesized polyglot speech. For evaluation, a Mandarin-English polyglot system is implemented where the target speaker only speaks Mandarin. The results show that decent performance regarding voice identity and speech quality can be achieved with the proposed method.
doi_str_mv 10.1109/TASLP.2014.2339738
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_1564751919</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6857339</ieee_id><sourcerecordid>3442342291</sourcerecordid><originalsourceid>FETCH-LOGICAL-c295t-d60fdbbf87b9fcd60937b657f630adecd4d454e380a712aa1e1762266f99122c3</originalsourceid><addsrcrecordid>eNo9kMtOwzAQRS0EEhX0B2BjiXWKH4kdL0tFASkSldKuIyeetK7SpNjOIn9P-oDVvO6d0RyEniiZUUrU63qeZ6sZIzSeMc6V5OkNmjDOVKQ4iW__cqbIPZp6vyeEUCKVkvEE2VXXDNumCzg_AlQ7nA9t2IG3Hr9pDwZ3LV64zvsos-221w1eOn0AnEMDVbDjdOPHAZ73xobODVi3Bs9dsFXf6HNjCTr0Dvwjuqt142F6jQ9os3xfLz6j7PvjazHPooqpJERGkNqUZZ3KUtXVWCkuS5HIWnCiDVQmNnESA0-JlpRpTYFKwZgQtVKUsYo_oJfL3qPrfnrwodh3vWvHkwVNRCwTqqgaVeyiqk7POaiLo7MH7YaCkuJEtThTLU5UiyvV0fR8MVkA-DeINJGjgP8CF9F0Xw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1564751919</pqid></control><display><type>article</type><title>Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features</title><source>IEEE Electronic Library (IEL)</source><creator>Chen, Chia-Ping ; Huang, Yi-Chin ; Wu, Chung-Hsien ; Lee, Kuan-De</creator><creatorcontrib>Chen, Chia-Ping ; Huang, Yi-Chin ; Wu, Chung-Hsien ; Lee, Kuan-De</creatorcontrib><description>In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentially, a set of artificial utterances in the second language for a target speaker is constructed based on the proposed cross-lingual frame-selection process, and this data set is used to adapt a synthesis model in the second language to the speaker. In the cross-lingual frame-selection process, we propose to use auditory and articulatory features to improve the quality of the synthesized polyglot speech. For evaluation, a Mandarin-English polyglot system is implemented where the target speaker only speaks Mandarin. The results show that decent performance regarding voice identity and speech quality can be achieved with the proposed method.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2014.2339738</identifier><identifier>CODEN: ITASD8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Adaptation models ; Articulatory features ; auditory features ; cross-lingual frame selection ; Feature extraction ; Hidden Markov models ; IEEE transactions ; polyglot speech synthesis ; Speech ; Speech synthesis</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2014-10, Vol.22 (10), p.1558-1570</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Oct 2014</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c295t-d60fdbbf87b9fcd60937b657f630adecd4d454e380a712aa1e1762266f99122c3</citedby><cites>FETCH-LOGICAL-c295t-d60fdbbf87b9fcd60937b657f630adecd4d454e380a712aa1e1762266f99122c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6857339$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6857339$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Chen, Chia-Ping</creatorcontrib><creatorcontrib>Huang, Yi-Chin</creatorcontrib><creatorcontrib>Wu, Chung-Hsien</creatorcontrib><creatorcontrib>Lee, Kuan-De</creatorcontrib><title>Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentially, a set of artificial utterances in the second language for a target speaker is constructed based on the proposed cross-lingual frame-selection process, and this data set is used to adapt a synthesis model in the second language to the speaker. In the cross-lingual frame-selection process, we propose to use auditory and articulatory features to improve the quality of the synthesized polyglot speech. For evaluation, a Mandarin-English polyglot system is implemented where the target speaker only speaks Mandarin. The results show that decent performance regarding voice identity and speech quality can be achieved with the proposed method.</description><subject>Adaptation models</subject><subject>Articulatory features</subject><subject>auditory features</subject><subject>cross-lingual frame selection</subject><subject>Feature extraction</subject><subject>Hidden Markov models</subject><subject>IEEE transactions</subject><subject>polyglot speech synthesis</subject><subject>Speech</subject><subject>Speech synthesis</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kMtOwzAQRS0EEhX0B2BjiXWKH4kdL0tFASkSldKuIyeetK7SpNjOIn9P-oDVvO6d0RyEniiZUUrU63qeZ6sZIzSeMc6V5OkNmjDOVKQ4iW__cqbIPZp6vyeEUCKVkvEE2VXXDNumCzg_AlQ7nA9t2IG3Hr9pDwZ3LV64zvsos-221w1eOn0AnEMDVbDjdOPHAZ73xobODVi3Bs9dsFXf6HNjCTr0Dvwjuqt142F6jQ9os3xfLz6j7PvjazHPooqpJERGkNqUZZ3KUtXVWCkuS5HIWnCiDVQmNnESA0-JlpRpTYFKwZgQtVKUsYo_oJfL3qPrfnrwodh3vWvHkwVNRCwTqqgaVeyiqk7POaiLo7MH7YaCkuJEtThTLU5UiyvV0fR8MVkA-DeINJGjgP8CF9F0Xw</recordid><startdate>201410</startdate><enddate>201410</enddate><creator>Chen, Chia-Ping</creator><creator>Huang, Yi-Chin</creator><creator>Wu, Chung-Hsien</creator><creator>Lee, Kuan-De</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>201410</creationdate><title>Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features</title><author>Chen, Chia-Ping ; Huang, Yi-Chin ; Wu, Chung-Hsien ; Lee, Kuan-De</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c295t-d60fdbbf87b9fcd60937b657f630adecd4d454e380a712aa1e1762266f99122c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Adaptation models</topic><topic>Articulatory features</topic><topic>auditory features</topic><topic>cross-lingual frame selection</topic><topic>Feature extraction</topic><topic>Hidden Markov models</topic><topic>IEEE transactions</topic><topic>polyglot speech synthesis</topic><topic>Speech</topic><topic>Speech synthesis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Chen, Chia-Ping</creatorcontrib><creatorcontrib>Huang, Yi-Chin</creatorcontrib><creatorcontrib>Wu, Chung-Hsien</creatorcontrib><creatorcontrib>Lee, Kuan-De</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Chia-Ping</au><au>Huang, Yi-Chin</au><au>Wu, Chung-Hsien</au><au>Lee, Kuan-De</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2014-10</date><risdate>2014</risdate><volume>22</volume><issue>10</issue><spage>1558</spage><epage>1570</epage><pages>1558-1570</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASD8</coden><abstract>In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentially, a set of artificial utterances in the second language for a target speaker is constructed based on the proposed cross-lingual frame-selection process, and this data set is used to adapt a synthesis model in the second language to the speaker. In the cross-lingual frame-selection process, we propose to use auditory and articulatory features to improve the quality of the synthesized polyglot speech. For evaluation, a Mandarin-English polyglot system is implemented where the target speaker only speaks Mandarin. The results show that decent performance regarding voice identity and speech quality can be achieved with the proposed method.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2014.2339738</doi><tpages>13</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2329-9290
ispartof IEEE/ACM transactions on audio, speech, and language processing, 2014-10, Vol.22 (10), p.1558-1570
issn 2329-9290
2329-9304
language eng
recordid cdi_proquest_journals_1564751919
source IEEE Electronic Library (IEL)
subjects Adaptation models
Articulatory features
auditory features
cross-lingual frame selection
Feature extraction
Hidden Markov models
IEEE transactions
polyglot speech synthesis
Speech
Speech synthesis
title Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T05%3A54%3A04IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Polyglot%20Speech%20Synthesis%20Based%20on%20Cross-Lingual%20Frame%20Selection%20Using%20Auditory%20and%20Articulatory%20Features&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Chen,%20Chia-Ping&rft.date=2014-10&rft.volume=22&rft.issue=10&rft.spage=1558&rft.epage=1570&rft.pages=1558-1570&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASD8&rft_id=info:doi/10.1109/TASLP.2014.2339738&rft_dat=%3Cproquest_RIE%3E3442342291%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1564751919&rft_id=info:pmid/&rft_ieee_id=6857339&rfr_iscdi=true