Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis

This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2021-11
Hauptverfasser:	Vioni, Alexandra, Christidou, Myrsini, Ellinas, Nikolaos, Vamvoukakis, Georgios, Kakoulidis, Panos, Kim, Taehoon, June Sig Sung, Park, Hyoungmin, Chalamandaris, Aimilios, Tsiakoulis, Pirros
Format:	Artikel
Sprache:	eng
Schlagworte:	Centroids Clustering Coders Computer Science - Computation and Language Computer Science - Learning Computer Science - Sound Control methods Feature extraction Linguistics Phonemes Speech recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Vioni, Alexandra Christidou, Myrsini Ellinas, Nikolaos Vamvoukakis, Georgios Kakoulidis, Panos Kim, Taehoon June Sig Sung Park, Hyoungmin Chalamandaris, Aimilios Tsiakoulis, Pirros
description	This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.
doi_str_mv	10.48550/arxiv.2111.10177
format	Article
fullrecord	<record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2111_10177</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2600525531</sourcerecordid><originalsourceid>FETCH-LOGICAL-a521-cca170996b352e5b45258d3b8b0d03ec6123d09b4d9564ac5965070455e91d4b3</originalsourceid><addsrcrecordid>eNotkEtrwkAUhYdCoWL9AV11oOuxdx43j2UJthWECroPSeZaI3HGzkSp_75qujqbj8P5DmNPEqYmQ4TXKvy2p6mSUk4lyDS9YyOltRSZUeqBTWLcAYBKUoWoR2y9DD562za86I6xp9C6b77xgS-33tGeREcn6vhAnXnhXR98x1vHZ86K3otL8NWBqNny1dn1W4ptfGT3m6qLNPnPMVu_z9bFp1h8fcyLt4WoUEnRNJVMIc-TWqMirA0qzKyusxosaGoSqbSFvDY2x8RUDeYJQgoGkXJpTa3H7HmovSmXh9Duq3Aur-rlTf1CvAzEIfifI8W-3PljcJdNpUoA8PqB1H-KJlrh</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2600525531</pqid></control><display><type>article</type><title>Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis</title><source>Freely Accessible Journals</source><source>arXiv.org</source><creator>Vioni, Alexandra ; Christidou, Myrsini ; Ellinas, Nikolaos ; Vamvoukakis, Georgios ; Kakoulidis, Panos ; Kim, Taehoon ; June Sig Sung ; Park, Hyoungmin ; Chalamandaris, Aimilios ; Tsiakoulis, Pirros</creator><creatorcontrib>Vioni, Alexandra ; Christidou, Myrsini ; Ellinas, Nikolaos ; Vamvoukakis, Georgios ; Kakoulidis, Panos ; Kim, Taehoon ; June Sig Sung ; Park, Hyoungmin ; Chalamandaris, Aimilios ; Tsiakoulis, Pirros</creatorcontrib><description>This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2111.10177</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Centroids ; Clustering ; Coders ; Computer Science - Computation and Language ; Computer Science - Learning ; Computer Science - Sound ; Control methods ; Feature extraction ; Linguistics ; Phonemes ; Speech recognition</subject><ispartof>arXiv.org, 2021-11</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,782,786,887,27932</link.rule.ids><backlink>$$Uhttps://doi.org/10.1109/ICASSP39728.2021.9413604$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.48550/arXiv.2111.10177$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Vioni, Alexandra</creatorcontrib><creatorcontrib>Christidou, Myrsini</creatorcontrib><creatorcontrib>Ellinas, Nikolaos</creatorcontrib><creatorcontrib>Vamvoukakis, Georgios</creatorcontrib><creatorcontrib>Kakoulidis, Panos</creatorcontrib><creatorcontrib>Kim, Taehoon</creatorcontrib><creatorcontrib>June Sig Sung</creatorcontrib><creatorcontrib>Park, Hyoungmin</creatorcontrib><creatorcontrib>Chalamandaris, Aimilios</creatorcontrib><creatorcontrib>Tsiakoulis, Pirros</creatorcontrib><title>Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis</title><title>arXiv.org</title><description>This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.</description><subject>Centroids</subject><subject>Clustering</subject><subject>Coders</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><subject>Control methods</subject><subject>Feature extraction</subject><subject>Linguistics</subject><subject>Phonemes</subject><subject>Speech recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotkEtrwkAUhYdCoWL9AV11oOuxdx43j2UJthWECroPSeZaI3HGzkSp_75qujqbj8P5DmNPEqYmQ4TXKvy2p6mSUk4lyDS9YyOltRSZUeqBTWLcAYBKUoWoR2y9DD562za86I6xp9C6b77xgS-33tGeREcn6vhAnXnhXR98x1vHZ86K3otL8NWBqNny1dn1W4ptfGT3m6qLNPnPMVu_z9bFp1h8fcyLt4WoUEnRNJVMIc-TWqMirA0qzKyusxosaGoSqbSFvDY2x8RUDeYJQgoGkXJpTa3H7HmovSmXh9Duq3Aur-rlTf1CvAzEIfifI8W-3PljcJdNpUoA8PqB1H-KJlrh</recordid><startdate>20211119</startdate><enddate>20211119</enddate><creator>Vioni, Alexandra</creator><creator>Christidou, Myrsini</creator><creator>Ellinas, Nikolaos</creator><creator>Vamvoukakis, Georgios</creator><creator>Kakoulidis, Panos</creator><creator>Kim, Taehoon</creator><creator>June Sig Sung</creator><creator>Park, Hyoungmin</creator><creator>Chalamandaris, Aimilios</creator><creator>Tsiakoulis, Pirros</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20211119</creationdate><title>Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis</title><author>Vioni, Alexandra ; Christidou, Myrsini ; Ellinas, Nikolaos ; Vamvoukakis, Georgios ; Kakoulidis, Panos ; Kim, Taehoon ; June Sig Sung ; Park, Hyoungmin ; Chalamandaris, Aimilios ; Tsiakoulis, Pirros</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a521-cca170996b352e5b45258d3b8b0d03ec6123d09b4d9564ac5965070455e91d4b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Centroids</topic><topic>Clustering</topic><topic>Coders</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><topic>Control methods</topic><topic>Feature extraction</topic><topic>Linguistics</topic><topic>Phonemes</topic><topic>Speech recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Vioni, Alexandra</creatorcontrib><creatorcontrib>Christidou, Myrsini</creatorcontrib><creatorcontrib>Ellinas, Nikolaos</creatorcontrib><creatorcontrib>Vamvoukakis, Georgios</creatorcontrib><creatorcontrib>Kakoulidis, Panos</creatorcontrib><creatorcontrib>Kim, Taehoon</creatorcontrib><creatorcontrib>June Sig Sung</creatorcontrib><creatorcontrib>Park, Hyoungmin</creatorcontrib><creatorcontrib>Chalamandaris, Aimilios</creatorcontrib><creatorcontrib>Tsiakoulis, Pirros</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Vioni, Alexandra</au><au>Christidou, Myrsini</au><au>Ellinas, Nikolaos</au><au>Vamvoukakis, Georgios</au><au>Kakoulidis, Panos</au><au>Kim, Taehoon</au><au>June Sig Sung</au><au>Park, Hyoungmin</au><au>Chalamandaris, Aimilios</au><au>Tsiakoulis, Pirros</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis</atitle><jtitle>arXiv.org</jtitle><date>2021-11-19</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2111.10177</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2021-11
issn	2331-8422
language	eng
recordid	cdi_arxiv_primary_2111_10177
source	Freely Accessible Journals; arXiv.org
subjects	Centroids Clustering Coders Computer Science - Computation and Language Computer Science - Learning Computer Science - Sound Control methods Feature extraction Linguistics Phonemes Speech recognition
title	Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-04T07%3A12%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Prosodic%20Clustering%20for%20Phoneme-level%20Prosody%20Control%20in%20End-to-End%20Speech%20Synthesis&rft.jtitle=arXiv.org&rft.au=Vioni,%20Alexandra&rft.date=2021-11-19&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2111.10177&rft_dat=%3Cproquest_arxiv%3E2600525531%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2600525531&rft_id=info:pmid/&rfr_iscdi=true