Document-level Neural TTS using Curriculum Learning and Attention Masking

Speech synthesis has been developed to the level of natural human-level speech synthesized through an attention-based end-to-end text-to-speech synthesis (TTS) model. However, it is difficult to generate attention when synthesizing a text longer than the trained length or document-level text. In thi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2021-01, Vol.9, p.1-1
Hauptverfasser:	Hwang, Sung-Woong, Chang, Joon-Hyuk
Format:	Artikel
Sprache:	eng
Schlagworte:	attention masking Context modeling Curricula curriculum learning Data models DeepVoice3 document-level neural TTS Graphics processing units Learning Masking MelGAN MultiSpeech ParaNet Predictive models Singing Spectrogram Speech Speech recognition Speech synthesis Synthesis Tacotron2 Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1
container_issue
container_start_page	1
container_title	IEEE access
container_volume	9
creator	Hwang, Sung-Woong Chang, Joon-Hyuk
description	Speech synthesis has been developed to the level of natural human-level speech synthesized through an attention-based end-to-end text-to-speech synthesis (TTS) model. However, it is difficult to generate attention when synthesizing a text longer than the trained length or document-level text. In this paper, we propose a neural speech synthesis model that can synthesize more than 5 min of speech at once using training data comprising a short speech of less than 10 s. This model can be used for tasks that need to synthesize document-level speech at a time, such as a singing voice synthesis (SVS) system or a book reading system. First, through curriculum learning, our model automatically increases the length of the speech trained for each epoch, while reducing the batch size so that long sentences can be trained with a limited graphics processing unit (GPU) capacity. During synthesis, the document-level text is synthesized using only the necessary contexts of the current time step and masking the rest through an attention-masking mechanism. The Tacotron2-based speech synthesis model and duration predictor were used in the experiment, and the results showed that proposed method can synthesize document-level speech with overwhelmingly lower character error rate, and attention error rates, and higher quality than those obtained using the existing model.
doi_str_mv	10.1109/ACCESS.2020.3049073
format	Article
fullrecord	<record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_9312676</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9312676</ieee_id><doaj_id>oai_doaj_org_article_e5e9d71aa603485cab3e50756fab497d</doaj_id><sourcerecordid>2478834008</sourcerecordid><originalsourceid>FETCH-LOGICAL-c408t-1a9561e35826e6d1f5f308a513bffe81ce1089958507a2c3d8469309f703754d3</originalsourceid><addsrcrecordid>eNpNUU1PwkAQbYwmEuQXcGniubjb6X4dSUUlQT2A583STkmxdHG3NfHfu1hCnMtMXt57M5kXRVNKZpQS9TDP88V6PUtJSmZAMkUEXEWjlHKVAAN-_W--jSbe70koGSAmRtHy0Rb9AdsuafAbm_gNe2eaeLNZx72v212c987VRd_0h3iFxrUnzLRlPO-6oKptG78a_xnQu-imMo3HybmPo4-nxSZ_SVbvz8t8vkqKjMguoUYxThGYTDnyklasAiINo7CtKpS0QEqkUkwyIkxaQCkzroCoShAQLCthHC0H39KavT66-mDcj7am1n-AdTttXFcXDWpkqEpBjeEEMskKswUMtoxXZpspcfK6H7yOzn716Du9t71rw_k6zYSUkIVPBRYMrMJZ7x1Wl62U6FMEeohAnyLQ5wiCajqoakS8KBTQlAsOvwy7gC4</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2478834008</pqid></control><display><type>article</type><title>Document-level Neural TTS using Curriculum Learning and Attention Masking</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Hwang, Sung-Woong ; Chang, Joon-Hyuk</creator><creatorcontrib>Hwang, Sung-Woong ; Chang, Joon-Hyuk</creatorcontrib><description>Speech synthesis has been developed to the level of natural human-level speech synthesized through an attention-based end-to-end text-to-speech synthesis (TTS) model. However, it is difficult to generate attention when synthesizing a text longer than the trained length or document-level text. In this paper, we propose a neural speech synthesis model that can synthesize more than 5 min of speech at once using training data comprising a short speech of less than 10 s. This model can be used for tasks that need to synthesize document-level speech at a time, such as a singing voice synthesis (SVS) system or a book reading system. First, through curriculum learning, our model automatically increases the length of the speech trained for each epoch, while reducing the batch size so that long sentences can be trained with a limited graphics processing unit (GPU) capacity. During synthesis, the document-level text is synthesized using only the necessary contexts of the current time step and masking the rest through an attention-masking mechanism. The Tacotron2-based speech synthesis model and duration predictor were used in the experiment, and the results showed that proposed method can synthesize document-level speech with overwhelmingly lower character error rate, and attention error rates, and higher quality than those obtained using the existing model.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2020.3049073</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>attention masking ; Context modeling ; Curricula ; curriculum learning ; Data models ; DeepVoice3 ; document-level neural TTS ; Graphics processing units ; Learning ; Masking ; MelGAN ; MultiSpeech ; ParaNet ; Predictive models ; Singing ; Spectrogram ; Speech ; Speech recognition ; Speech synthesis ; Synthesis ; Tacotron2 ; Training</subject><ispartof>IEEE access, 2021-01, Vol.9, p.1-1</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c408t-1a9561e35826e6d1f5f308a513bffe81ce1089958507a2c3d8469309f703754d3</citedby><cites>FETCH-LOGICAL-c408t-1a9561e35826e6d1f5f308a513bffe81ce1089958507a2c3d8469309f703754d3</cites><orcidid>0000-0003-2610-2323 ; 0000-0001-6194-9752</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9312676$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,864,2100,27632,27923,27924,54932</link.rule.ids></links><search><creatorcontrib>Hwang, Sung-Woong</creatorcontrib><creatorcontrib>Chang, Joon-Hyuk</creatorcontrib><title>Document-level Neural TTS using Curriculum Learning and Attention Masking</title><title>IEEE access</title><addtitle>Access</addtitle><description>Speech synthesis has been developed to the level of natural human-level speech synthesized through an attention-based end-to-end text-to-speech synthesis (TTS) model. However, it is difficult to generate attention when synthesizing a text longer than the trained length or document-level text. In this paper, we propose a neural speech synthesis model that can synthesize more than 5 min of speech at once using training data comprising a short speech of less than 10 s. This model can be used for tasks that need to synthesize document-level speech at a time, such as a singing voice synthesis (SVS) system or a book reading system. First, through curriculum learning, our model automatically increases the length of the speech trained for each epoch, while reducing the batch size so that long sentences can be trained with a limited graphics processing unit (GPU) capacity. During synthesis, the document-level text is synthesized using only the necessary contexts of the current time step and masking the rest through an attention-masking mechanism. The Tacotron2-based speech synthesis model and duration predictor were used in the experiment, and the results showed that proposed method can synthesize document-level speech with overwhelmingly lower character error rate, and attention error rates, and higher quality than those obtained using the existing model.</description><subject>attention masking</subject><subject>Context modeling</subject><subject>Curricula</subject><subject>curriculum learning</subject><subject>Data models</subject><subject>DeepVoice3</subject><subject>document-level neural TTS</subject><subject>Graphics processing units</subject><subject>Learning</subject><subject>Masking</subject><subject>MelGAN</subject><subject>MultiSpeech</subject><subject>ParaNet</subject><subject>Predictive models</subject><subject>Singing</subject><subject>Spectrogram</subject><subject>Speech</subject><subject>Speech recognition</subject><subject>Speech synthesis</subject><subject>Synthesis</subject><subject>Tacotron2</subject><subject>Training</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUU1PwkAQbYwmEuQXcGniubjb6X4dSUUlQT2A583STkmxdHG3NfHfu1hCnMtMXt57M5kXRVNKZpQS9TDP88V6PUtJSmZAMkUEXEWjlHKVAAN-_W--jSbe70koGSAmRtHy0Rb9AdsuafAbm_gNe2eaeLNZx72v212c987VRd_0h3iFxrUnzLRlPO-6oKptG78a_xnQu-imMo3HybmPo4-nxSZ_SVbvz8t8vkqKjMguoUYxThGYTDnyklasAiINo7CtKpS0QEqkUkwyIkxaQCkzroCoShAQLCthHC0H39KavT66-mDcj7am1n-AdTttXFcXDWpkqEpBjeEEMskKswUMtoxXZpspcfK6H7yOzn716Du9t71rw_k6zYSUkIVPBRYMrMJZ7x1Wl62U6FMEeohAnyLQ5wiCajqoakS8KBTQlAsOvwy7gC4</recordid><startdate>20210101</startdate><enddate>20210101</enddate><creator>Hwang, Sung-Woong</creator><creator>Chang, Joon-Hyuk</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-2610-2323</orcidid><orcidid>https://orcid.org/0000-0001-6194-9752</orcidid></search><sort><creationdate>20210101</creationdate><title>Document-level Neural TTS using Curriculum Learning and Attention Masking</title><author>Hwang, Sung-Woong ; Chang, Joon-Hyuk</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c408t-1a9561e35826e6d1f5f308a513bffe81ce1089958507a2c3d8469309f703754d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>attention masking</topic><topic>Context modeling</topic><topic>Curricula</topic><topic>curriculum learning</topic><topic>Data models</topic><topic>DeepVoice3</topic><topic>document-level neural TTS</topic><topic>Graphics processing units</topic><topic>Learning</topic><topic>Masking</topic><topic>MelGAN</topic><topic>MultiSpeech</topic><topic>ParaNet</topic><topic>Predictive models</topic><topic>Singing</topic><topic>Spectrogram</topic><topic>Speech</topic><topic>Speech recognition</topic><topic>Speech synthesis</topic><topic>Synthesis</topic><topic>Tacotron2</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Hwang, Sung-Woong</creatorcontrib><creatorcontrib>Chang, Joon-Hyuk</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hwang, Sung-Woong</au><au>Chang, Joon-Hyuk</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Document-level Neural TTS using Curriculum Learning and Attention Masking</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2021-01-01</date><risdate>2021</risdate><volume>9</volume><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Speech synthesis has been developed to the level of natural human-level speech synthesized through an attention-based end-to-end text-to-speech synthesis (TTS) model. However, it is difficult to generate attention when synthesizing a text longer than the trained length or document-level text. In this paper, we propose a neural speech synthesis model that can synthesize more than 5 min of speech at once using training data comprising a short speech of less than 10 s. This model can be used for tasks that need to synthesize document-level speech at a time, such as a singing voice synthesis (SVS) system or a book reading system. First, through curriculum learning, our model automatically increases the length of the speech trained for each epoch, while reducing the batch size so that long sentences can be trained with a limited graphics processing unit (GPU) capacity. During synthesis, the document-level text is synthesized using only the necessary contexts of the current time step and masking the rest through an attention-masking mechanism. The Tacotron2-based speech synthesis model and duration predictor were used in the experiment, and the results showed that proposed method can synthesize document-level speech with overwhelmingly lower character error rate, and attention error rates, and higher quality than those obtained using the existing model.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2020.3049073</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0003-2610-2323</orcidid><orcidid>https://orcid.org/0000-0001-6194-9752</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2169-3536
ispartof	IEEE access, 2021-01, Vol.9, p.1-1
issn	2169-3536 2169-3536
language	eng
recordid	cdi_ieee_primary_9312676
source	IEEE Open Access Journals; DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals
subjects	attention masking Context modeling Curricula curriculum learning Data models DeepVoice3 document-level neural TTS Graphics processing units Learning Masking MelGAN MultiSpeech ParaNet Predictive models Singing Spectrogram Speech Speech recognition Speech synthesis Synthesis Tacotron2 Training
title	Document-level Neural TTS using Curriculum Learning and Attention Masking
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T08%3A41%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Document-level%20Neural%20TTS%20using%20Curriculum%20Learning%20and%20Attention%20Masking&rft.jtitle=IEEE%20access&rft.au=Hwang,%20Sung-Woong&rft.date=2021-01-01&rft.volume=9&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2020.3049073&rft_dat=%3Cproquest_ieee_%3E2478834008%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2478834008&rft_id=info:pmid/&rft_ieee_id=9312676&rft_doaj_id=oai_doaj_org_article_e5e9d71aa603485cab3e50756fab497d&rfr_iscdi=true