Multi-Voice Singing Synthesis From Lyrics

In this paper, a multi-voice singing synthesis framework is proposed to convert lyrics to their sung version in the target speaker’s voice. It consists of three blocks: a text-to-speech (TTS) module, a speech-to-singing (STS) module, and an intelligibility enhancement module. Synthesized speech is g...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Circuits, systems, and signal processing systems, and signal processing, 2023, Vol.42 (1), p.307-321
Hauptverfasser:	Resna, S., Rajan, Rajeev
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Annotations Circuits and Systems Coders Electrical Engineering Electronics and Microelectronics Engineering Generative adversarial networks Instrumentation Intelligibility Lyrics Modules Multilingualism Phonemes Phonetics Signal processing Signal,Image and Speech Processing Singers Singing Speech Speech recognition Speech synthesis Synthesis Text-to-speech
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	321
container_issue	1
container_start_page	307
container_title	Circuits, systems, and signal processing
container_volume	42
creator	Resna, S. Rajan, Rajeev
description	In this paper, a multi-voice singing synthesis framework is proposed to convert lyrics to their sung version in the target speaker’s voice. It consists of three blocks: a text-to-speech (TTS) module, a speech-to-singing (STS) module, and an intelligibility enhancement module. Synthesized speech is generated from lyrics for a target speaker’s voice by a TTS converter in the front end. Later, a sung version is synthesized in target melody through an encoder–decoder model in the STS module. Further, phonetic intelligibility is enhanced using an intelligibility enhancement module based on an audio style transfer scheme. The proposed system is systematically evaluated using LibriSpeech and NUS-48E corpus using subjective and objective evaluation. We have compared our model with a state-of-the-art multi-voice singing synthesis model based on a generative adversarial network (GAN). Our study shows that the proposed model performs on par with the baseline model without any phoneme annotations.
doi_str_mv	10.1007/s00034-022-02122-3
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2760705090</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2760705090</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-163983840ddc7b2e67bbece74cdcb367bc3f07e4157ea5b024a7b41f2b547f63</originalsourceid><addsrcrecordid>eNp9kEtLxDAUhYMoWEf_gKuCKxfRm1eTLmVwRqHiYgZxF5o0HTPMtGPSLubfG63gTrgPLpxzLnwIXRO4IwDyPgIA4xgoTU3SZCcoI4IRLJRUpygDKhUGRd7P0UWMWwBS8pJm6PZl3A0ev_Xeunzlu02qfHXshg8XfcwXod_n1TF4Gy_RWVvvorv63TO0Xjyu50-4el0-zx8qbBkpB0wKViqmODSNlYa6QhrjrJPcNtawdFnWgnScCOlqYYDyWhpOWmoEl23BZuhmij2E_nN0cdDbfgxd-qipLECCgBKSik4qG_oYg2v1Ifh9HY6agP4moiciOhHRP0Q0SyY2mWISdxsX_qL_cX0BP4FiIQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2760705090</pqid></control><display><type>article</type><title>Multi-Voice Singing Synthesis From Lyrics</title><source>SpringerNature Journals</source><creator>Resna, S. ; Rajan, Rajeev</creator><creatorcontrib>Resna, S. ; Rajan, Rajeev</creatorcontrib><description>In this paper, a multi-voice singing synthesis framework is proposed to convert lyrics to their sung version in the target speaker’s voice. It consists of three blocks: a text-to-speech (TTS) module, a speech-to-singing (STS) module, and an intelligibility enhancement module. Synthesized speech is generated from lyrics for a target speaker’s voice by a TTS converter in the front end. Later, a sung version is synthesized in target melody through an encoder–decoder model in the STS module. Further, phonetic intelligibility is enhanced using an intelligibility enhancement module based on an audio style transfer scheme. The proposed system is systematically evaluated using LibriSpeech and NUS-48E corpus using subjective and objective evaluation. We have compared our model with a state-of-the-art multi-voice singing synthesis model based on a generative adversarial network (GAN). Our study shows that the proposed model performs on par with the baseline model without any phoneme annotations.</description><identifier>ISSN: 0278-081X</identifier><identifier>EISSN: 1531-5878</identifier><identifier>DOI: 10.1007/s00034-022-02122-3</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Acoustics ; Annotations ; Circuits and Systems ; Coders ; Electrical Engineering ; Electronics and Microelectronics ; Engineering ; Generative adversarial networks ; Instrumentation ; Intelligibility ; Lyrics ; Modules ; Multilingualism ; Phonemes ; Phonetics ; Signal processing ; Signal,Image and Speech Processing ; Singers ; Singing ; Speech ; Speech recognition ; Speech synthesis ; Synthesis ; Text-to-speech</subject><ispartof>Circuits, systems, and signal processing, 2023, Vol.42 (1), p.307-321</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022. Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-163983840ddc7b2e67bbece74cdcb367bc3f07e4157ea5b024a7b41f2b547f63</citedby><cites>FETCH-LOGICAL-c319t-163983840ddc7b2e67bbece74cdcb367bc3f07e4157ea5b024a7b41f2b547f63</cites><orcidid>0000-0001-5488-9026</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s00034-022-02122-3$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s00034-022-02122-3$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Resna, S.</creatorcontrib><creatorcontrib>Rajan, Rajeev</creatorcontrib><title>Multi-Voice Singing Synthesis From Lyrics</title><title>Circuits, systems, and signal processing</title><addtitle>Circuits Syst Signal Process</addtitle><description>In this paper, a multi-voice singing synthesis framework is proposed to convert lyrics to their sung version in the target speaker’s voice. It consists of three blocks: a text-to-speech (TTS) module, a speech-to-singing (STS) module, and an intelligibility enhancement module. Synthesized speech is generated from lyrics for a target speaker’s voice by a TTS converter in the front end. Later, a sung version is synthesized in target melody through an encoder–decoder model in the STS module. Further, phonetic intelligibility is enhanced using an intelligibility enhancement module based on an audio style transfer scheme. The proposed system is systematically evaluated using LibriSpeech and NUS-48E corpus using subjective and objective evaluation. We have compared our model with a state-of-the-art multi-voice singing synthesis model based on a generative adversarial network (GAN). Our study shows that the proposed model performs on par with the baseline model without any phoneme annotations.</description><subject>Acoustics</subject><subject>Annotations</subject><subject>Circuits and Systems</subject><subject>Coders</subject><subject>Electrical Engineering</subject><subject>Electronics and Microelectronics</subject><subject>Engineering</subject><subject>Generative adversarial networks</subject><subject>Instrumentation</subject><subject>Intelligibility</subject><subject>Lyrics</subject><subject>Modules</subject><subject>Multilingualism</subject><subject>Phonemes</subject><subject>Phonetics</subject><subject>Signal processing</subject><subject>Signal,Image and Speech Processing</subject><subject>Singers</subject><subject>Singing</subject><subject>Speech</subject><subject>Speech recognition</subject><subject>Speech synthesis</subject><subject>Synthesis</subject><subject>Text-to-speech</subject><issn>0278-081X</issn><issn>1531-5878</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp9kEtLxDAUhYMoWEf_gKuCKxfRm1eTLmVwRqHiYgZxF5o0HTPMtGPSLubfG63gTrgPLpxzLnwIXRO4IwDyPgIA4xgoTU3SZCcoI4IRLJRUpygDKhUGRd7P0UWMWwBS8pJm6PZl3A0ev_Xeunzlu02qfHXshg8XfcwXod_n1TF4Gy_RWVvvorv63TO0Xjyu50-4el0-zx8qbBkpB0wKViqmODSNlYa6QhrjrJPcNtawdFnWgnScCOlqYYDyWhpOWmoEl23BZuhmij2E_nN0cdDbfgxd-qipLECCgBKSik4qG_oYg2v1Ifh9HY6agP4moiciOhHRP0Q0SyY2mWISdxsX_qL_cX0BP4FiIQ</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Resna, S.</creator><creator>Rajan, Rajeev</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7SC</scope><scope>7SP</scope><scope>7T9</scope><scope>7XB</scope><scope>88I</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0N</scope><scope>M2P</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>Q9U</scope><scope>S0W</scope><orcidid>https://orcid.org/0000-0001-5488-9026</orcidid></search><sort><creationdate>2023</creationdate><title>Multi-Voice Singing Synthesis From Lyrics</title><author>Resna, S. ; Rajan, Rajeev</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-163983840ddc7b2e67bbece74cdcb367bc3f07e4157ea5b024a7b41f2b547f63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Acoustics</topic><topic>Annotations</topic><topic>Circuits and Systems</topic><topic>Coders</topic><topic>Electrical Engineering</topic><topic>Electronics and Microelectronics</topic><topic>Engineering</topic><topic>Generative adversarial networks</topic><topic>Instrumentation</topic><topic>Intelligibility</topic><topic>Lyrics</topic><topic>Modules</topic><topic>Multilingualism</topic><topic>Phonemes</topic><topic>Phonetics</topic><topic>Signal processing</topic><topic>Signal,Image and Speech Processing</topic><topic>Singers</topic><topic>Singing</topic><topic>Speech</topic><topic>Speech recognition</topic><topic>Speech synthesis</topic><topic>Synthesis</topic><topic>Text-to-speech</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Resna, S.</creatorcontrib><creatorcontrib>Rajan, Rajeev</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Science Database (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Computing Database</collection><collection>Science Database</collection><collection>Engineering Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>ProQuest Central Basic</collection><collection>DELNET Engineering & Technology Collection</collection><jtitle>Circuits, systems, and signal processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Resna, S.</au><au>Rajan, Rajeev</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multi-Voice Singing Synthesis From Lyrics</atitle><jtitle>Circuits, systems, and signal processing</jtitle><stitle>Circuits Syst Signal Process</stitle><date>2023</date><risdate>2023</risdate><volume>42</volume><issue>1</issue><spage>307</spage><epage>321</epage><pages>307-321</pages><issn>0278-081X</issn><eissn>1531-5878</eissn><abstract>In this paper, a multi-voice singing synthesis framework is proposed to convert lyrics to their sung version in the target speaker’s voice. It consists of three blocks: a text-to-speech (TTS) module, a speech-to-singing (STS) module, and an intelligibility enhancement module. Synthesized speech is generated from lyrics for a target speaker’s voice by a TTS converter in the front end. Later, a sung version is synthesized in target melody through an encoder–decoder model in the STS module. Further, phonetic intelligibility is enhanced using an intelligibility enhancement module based on an audio style transfer scheme. The proposed system is systematically evaluated using LibriSpeech and NUS-48E corpus using subjective and objective evaluation. We have compared our model with a state-of-the-art multi-voice singing synthesis model based on a generative adversarial network (GAN). Our study shows that the proposed model performs on par with the baseline model without any phoneme annotations.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s00034-022-02122-3</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0001-5488-9026</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0278-081X
ispartof	Circuits, systems, and signal processing, 2023, Vol.42 (1), p.307-321
issn	0278-081X 1531-5878
language	eng
recordid	cdi_proquest_journals_2760705090
source	SpringerNature Journals
subjects	Acoustics Annotations Circuits and Systems Coders Electrical Engineering Electronics and Microelectronics Engineering Generative adversarial networks Instrumentation Intelligibility Lyrics Modules Multilingualism Phonemes Phonetics Signal processing Signal,Image and Speech Processing Singers Singing Speech Speech recognition Speech synthesis Synthesis Text-to-speech
title	Multi-Voice Singing Synthesis From Lyrics
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T19%3A37%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multi-Voice%20Singing%20Synthesis%20From%20Lyrics&rft.jtitle=Circuits,%20systems,%20and%20signal%20processing&rft.au=Resna,%20S.&rft.date=2023&rft.volume=42&rft.issue=1&rft.spage=307&rft.epage=321&rft.pages=307-321&rft.issn=0278-081X&rft.eissn=1531-5878&rft_id=info:doi/10.1007/s00034-022-02122-3&rft_dat=%3Cproquest_cross%3E2760705090%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2760705090&rft_id=info:pmid/&rfr_iscdi=true