Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis

This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2021-11
Hauptverfasser: Vioni, Alexandra, Christidou, Myrsini, Ellinas, Nikolaos, Vamvoukakis, Georgios, Kakoulidis, Panos, Kim, Taehoon, June Sig Sung, Park, Hyoungmin, Chalamandaris, Aimilios, Tsiakoulis, Pirros
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Vioni, Alexandra
Christidou, Myrsini
Ellinas, Nikolaos
Vamvoukakis, Georgios
Kakoulidis, Panos
Kim, Taehoon
June Sig Sung
Park, Hyoungmin
Chalamandaris, Aimilios
Tsiakoulis, Pirros
description This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.
doi_str_mv 10.48550/arxiv.2111.10177
format Article
fullrecord <record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2111_10177</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2600525531</sourcerecordid><originalsourceid>FETCH-LOGICAL-a521-cca170996b352e5b45258d3b8b0d03ec6123d09b4d9564ac5965070455e91d4b3</originalsourceid><addsrcrecordid>eNotkEtrwkAUhYdCoWL9AV11oOuxdx43j2UJthWECroPSeZaI3HGzkSp_75qujqbj8P5DmNPEqYmQ4TXKvy2p6mSUk4lyDS9YyOltRSZUeqBTWLcAYBKUoWoR2y9DD562za86I6xp9C6b77xgS-33tGeREcn6vhAnXnhXR98x1vHZ86K3otL8NWBqNny1dn1W4ptfGT3m6qLNPnPMVu_z9bFp1h8fcyLt4WoUEnRNJVMIc-TWqMirA0qzKyusxosaGoSqbSFvDY2x8RUDeYJQgoGkXJpTa3H7HmovSmXh9Duq3Aur-rlTf1CvAzEIfifI8W-3PljcJdNpUoA8PqB1H-KJlrh</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2600525531</pqid></control><display><type>article</type><title>Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis</title><source>Freely Accessible Journals</source><source>arXiv.org</source><creator>Vioni, Alexandra ; Christidou, Myrsini ; Ellinas, Nikolaos ; Vamvoukakis, Georgios ; Kakoulidis, Panos ; Kim, Taehoon ; June Sig Sung ; Park, Hyoungmin ; Chalamandaris, Aimilios ; Tsiakoulis, Pirros</creator><creatorcontrib>Vioni, Alexandra ; Christidou, Myrsini ; Ellinas, Nikolaos ; Vamvoukakis, Georgios ; Kakoulidis, Panos ; Kim, Taehoon ; June Sig Sung ; Park, Hyoungmin ; Chalamandaris, Aimilios ; Tsiakoulis, Pirros</creatorcontrib><description>This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2111.10177</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Centroids ; Clustering ; Coders ; Computer Science - Computation and Language ; Computer Science - Learning ; Computer Science - Sound ; Control methods ; Feature extraction ; Linguistics ; Phonemes ; Speech recognition</subject><ispartof>arXiv.org, 2021-11</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,782,786,887,27932</link.rule.ids><backlink>$$Uhttps://doi.org/10.1109/ICASSP39728.2021.9413604$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.48550/arXiv.2111.10177$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Vioni, Alexandra</creatorcontrib><creatorcontrib>Christidou, Myrsini</creatorcontrib><creatorcontrib>Ellinas, Nikolaos</creatorcontrib><creatorcontrib>Vamvoukakis, Georgios</creatorcontrib><creatorcontrib>Kakoulidis, Panos</creatorcontrib><creatorcontrib>Kim, Taehoon</creatorcontrib><creatorcontrib>June Sig Sung</creatorcontrib><creatorcontrib>Park, Hyoungmin</creatorcontrib><creatorcontrib>Chalamandaris, Aimilios</creatorcontrib><creatorcontrib>Tsiakoulis, Pirros</creatorcontrib><title>Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis</title><title>arXiv.org</title><description>This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.</description><subject>Centroids</subject><subject>Clustering</subject><subject>Coders</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><subject>Control methods</subject><subject>Feature extraction</subject><subject>Linguistics</subject><subject>Phonemes</subject><subject>Speech recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotkEtrwkAUhYdCoWL9AV11oOuxdx43j2UJthWECroPSeZaI3HGzkSp_75qujqbj8P5DmNPEqYmQ4TXKvy2p6mSUk4lyDS9YyOltRSZUeqBTWLcAYBKUoWoR2y9DD562za86I6xp9C6b77xgS-33tGeREcn6vhAnXnhXR98x1vHZ86K3otL8NWBqNny1dn1W4ptfGT3m6qLNPnPMVu_z9bFp1h8fcyLt4WoUEnRNJVMIc-TWqMirA0qzKyusxosaGoSqbSFvDY2x8RUDeYJQgoGkXJpTa3H7HmovSmXh9Duq3Aur-rlTf1CvAzEIfifI8W-3PljcJdNpUoA8PqB1H-KJlrh</recordid><startdate>20211119</startdate><enddate>20211119</enddate><creator>Vioni, Alexandra</creator><creator>Christidou, Myrsini</creator><creator>Ellinas, Nikolaos</creator><creator>Vamvoukakis, Georgios</creator><creator>Kakoulidis, Panos</creator><creator>Kim, Taehoon</creator><creator>June Sig Sung</creator><creator>Park, Hyoungmin</creator><creator>Chalamandaris, Aimilios</creator><creator>Tsiakoulis, Pirros</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20211119</creationdate><title>Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis</title><author>Vioni, Alexandra ; Christidou, Myrsini ; Ellinas, Nikolaos ; Vamvoukakis, Georgios ; Kakoulidis, Panos ; Kim, Taehoon ; June Sig Sung ; Park, Hyoungmin ; Chalamandaris, Aimilios ; Tsiakoulis, Pirros</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a521-cca170996b352e5b45258d3b8b0d03ec6123d09b4d9564ac5965070455e91d4b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Centroids</topic><topic>Clustering</topic><topic>Coders</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><topic>Control methods</topic><topic>Feature extraction</topic><topic>Linguistics</topic><topic>Phonemes</topic><topic>Speech recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Vioni, Alexandra</creatorcontrib><creatorcontrib>Christidou, Myrsini</creatorcontrib><creatorcontrib>Ellinas, Nikolaos</creatorcontrib><creatorcontrib>Vamvoukakis, Georgios</creatorcontrib><creatorcontrib>Kakoulidis, Panos</creatorcontrib><creatorcontrib>Kim, Taehoon</creatorcontrib><creatorcontrib>June Sig Sung</creatorcontrib><creatorcontrib>Park, Hyoungmin</creatorcontrib><creatorcontrib>Chalamandaris, Aimilios</creatorcontrib><creatorcontrib>Tsiakoulis, Pirros</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Vioni, Alexandra</au><au>Christidou, Myrsini</au><au>Ellinas, Nikolaos</au><au>Vamvoukakis, Georgios</au><au>Kakoulidis, Panos</au><au>Kim, Taehoon</au><au>June Sig Sung</au><au>Park, Hyoungmin</au><au>Chalamandaris, Aimilios</au><au>Tsiakoulis, Pirros</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis</atitle><jtitle>arXiv.org</jtitle><date>2021-11-19</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2111.10177</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2021-11
issn 2331-8422
language eng
recordid cdi_arxiv_primary_2111_10177
source Freely Accessible Journals; arXiv.org
subjects Centroids
Clustering
Coders
Computer Science - Computation and Language
Computer Science - Learning
Computer Science - Sound
Control methods
Feature extraction
Linguistics
Phonemes
Speech recognition
title Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-04T07%3A12%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Prosodic%20Clustering%20for%20Phoneme-level%20Prosody%20Control%20in%20End-to-End%20Speech%20Synthesis&rft.jtitle=arXiv.org&rft.au=Vioni,%20Alexandra&rft.date=2021-11-19&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2111.10177&rft_dat=%3Cproquest_arxiv%3E2600525531%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2600525531&rft_id=info:pmid/&rfr_iscdi=true