Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis
This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2021-11 |
---|---|
Hauptverfasser: | , , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Vioni, Alexandra Christidou, Myrsini Ellinas, Nikolaos Vamvoukakis, Georgios Kakoulidis, Panos Kim, Taehoon June Sig Sung Park, Hyoungmin Chalamandaris, Aimilios Tsiakoulis, Pirros |
description | This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker. |
doi_str_mv | 10.48550/arxiv.2111.10177 |
format | Article |
fullrecord | <record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2111_10177</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2600525531</sourcerecordid><originalsourceid>FETCH-LOGICAL-a521-cca170996b352e5b45258d3b8b0d03ec6123d09b4d9564ac5965070455e91d4b3</originalsourceid><addsrcrecordid>eNotkEtrwkAUhYdCoWL9AV11oOuxdx43j2UJthWECroPSeZaI3HGzkSp_75qujqbj8P5DmNPEqYmQ4TXKvy2p6mSUk4lyDS9YyOltRSZUeqBTWLcAYBKUoWoR2y9DD562za86I6xp9C6b77xgS-33tGeREcn6vhAnXnhXR98x1vHZ86K3otL8NWBqNny1dn1W4ptfGT3m6qLNPnPMVu_z9bFp1h8fcyLt4WoUEnRNJVMIc-TWqMirA0qzKyusxosaGoSqbSFvDY2x8RUDeYJQgoGkXJpTa3H7HmovSmXh9Duq3Aur-rlTf1CvAzEIfifI8W-3PljcJdNpUoA8PqB1H-KJlrh</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2600525531</pqid></control><display><type>article</type><title>Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis</title><source>Freely Accessible Journals</source><source>arXiv.org</source><creator>Vioni, Alexandra ; Christidou, Myrsini ; Ellinas, Nikolaos ; Vamvoukakis, Georgios ; Kakoulidis, Panos ; Kim, Taehoon ; June Sig Sung ; Park, Hyoungmin ; Chalamandaris, Aimilios ; Tsiakoulis, Pirros</creator><creatorcontrib>Vioni, Alexandra ; Christidou, Myrsini ; Ellinas, Nikolaos ; Vamvoukakis, Georgios ; Kakoulidis, Panos ; Kim, Taehoon ; June Sig Sung ; Park, Hyoungmin ; Chalamandaris, Aimilios ; Tsiakoulis, Pirros</creatorcontrib><description>This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2111.10177</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Centroids ; Clustering ; Coders ; Computer Science - Computation and Language ; Computer Science - Learning ; Computer Science - Sound ; Control methods ; Feature extraction ; Linguistics ; Phonemes ; Speech recognition</subject><ispartof>arXiv.org, 2021-11</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,782,786,887,27932</link.rule.ids><backlink>$$Uhttps://doi.org/10.1109/ICASSP39728.2021.9413604$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.48550/arXiv.2111.10177$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Vioni, Alexandra</creatorcontrib><creatorcontrib>Christidou, Myrsini</creatorcontrib><creatorcontrib>Ellinas, Nikolaos</creatorcontrib><creatorcontrib>Vamvoukakis, Georgios</creatorcontrib><creatorcontrib>Kakoulidis, Panos</creatorcontrib><creatorcontrib>Kim, Taehoon</creatorcontrib><creatorcontrib>June Sig Sung</creatorcontrib><creatorcontrib>Park, Hyoungmin</creatorcontrib><creatorcontrib>Chalamandaris, Aimilios</creatorcontrib><creatorcontrib>Tsiakoulis, Pirros</creatorcontrib><title>Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis</title><title>arXiv.org</title><description>This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.</description><subject>Centroids</subject><subject>Clustering</subject><subject>Coders</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><subject>Control methods</subject><subject>Feature extraction</subject><subject>Linguistics</subject><subject>Phonemes</subject><subject>Speech recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotkEtrwkAUhYdCoWL9AV11oOuxdx43j2UJthWECroPSeZaI3HGzkSp_75qujqbj8P5DmNPEqYmQ4TXKvy2p6mSUk4lyDS9YyOltRSZUeqBTWLcAYBKUoWoR2y9DD562za86I6xp9C6b77xgS-33tGeREcn6vhAnXnhXR98x1vHZ86K3otL8NWBqNny1dn1W4ptfGT3m6qLNPnPMVu_z9bFp1h8fcyLt4WoUEnRNJVMIc-TWqMirA0qzKyusxosaGoSqbSFvDY2x8RUDeYJQgoGkXJpTa3H7HmovSmXh9Duq3Aur-rlTf1CvAzEIfifI8W-3PljcJdNpUoA8PqB1H-KJlrh</recordid><startdate>20211119</startdate><enddate>20211119</enddate><creator>Vioni, Alexandra</creator><creator>Christidou, Myrsini</creator><creator>Ellinas, Nikolaos</creator><creator>Vamvoukakis, Georgios</creator><creator>Kakoulidis, Panos</creator><creator>Kim, Taehoon</creator><creator>June Sig Sung</creator><creator>Park, Hyoungmin</creator><creator>Chalamandaris, Aimilios</creator><creator>Tsiakoulis, Pirros</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20211119</creationdate><title>Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis</title><author>Vioni, Alexandra ; Christidou, Myrsini ; Ellinas, Nikolaos ; Vamvoukakis, Georgios ; Kakoulidis, Panos ; Kim, Taehoon ; June Sig Sung ; Park, Hyoungmin ; Chalamandaris, Aimilios ; Tsiakoulis, Pirros</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a521-cca170996b352e5b45258d3b8b0d03ec6123d09b4d9564ac5965070455e91d4b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Centroids</topic><topic>Clustering</topic><topic>Coders</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><topic>Control methods</topic><topic>Feature extraction</topic><topic>Linguistics</topic><topic>Phonemes</topic><topic>Speech recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Vioni, Alexandra</creatorcontrib><creatorcontrib>Christidou, Myrsini</creatorcontrib><creatorcontrib>Ellinas, Nikolaos</creatorcontrib><creatorcontrib>Vamvoukakis, Georgios</creatorcontrib><creatorcontrib>Kakoulidis, Panos</creatorcontrib><creatorcontrib>Kim, Taehoon</creatorcontrib><creatorcontrib>June Sig Sung</creatorcontrib><creatorcontrib>Park, Hyoungmin</creatorcontrib><creatorcontrib>Chalamandaris, Aimilios</creatorcontrib><creatorcontrib>Tsiakoulis, Pirros</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Vioni, Alexandra</au><au>Christidou, Myrsini</au><au>Ellinas, Nikolaos</au><au>Vamvoukakis, Georgios</au><au>Kakoulidis, Panos</au><au>Kim, Taehoon</au><au>June Sig Sung</au><au>Park, Hyoungmin</au><au>Chalamandaris, Aimilios</au><au>Tsiakoulis, Pirros</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis</atitle><jtitle>arXiv.org</jtitle><date>2021-11-19</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>This paper presents a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. Instead of learning latent prosodic features with a variational framework as is commonly done, we directly extract phoneme-level F0 and duration features from the speech data in the training set. Each prosodic feature is discretized using unsupervised clustering in order to produce a sequence of prosodic labels for each utterance. This sequence is used in parallel to the phoneme sequence in order to condition the decoder with the utilization of a prosodic encoder and a corresponding attention module. Experimental results show that the proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2111.10177</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2021-11 |
issn | 2331-8422 |
language | eng |
recordid | cdi_arxiv_primary_2111_10177 |
source | Freely Accessible Journals; arXiv.org |
subjects | Centroids Clustering Coders Computer Science - Computation and Language Computer Science - Learning Computer Science - Sound Control methods Feature extraction Linguistics Phonemes Speech recognition |
title | Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-04T07%3A12%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Prosodic%20Clustering%20for%20Phoneme-level%20Prosody%20Control%20in%20End-to-End%20Speech%20Synthesis&rft.jtitle=arXiv.org&rft.au=Vioni,%20Alexandra&rft.date=2021-11-19&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2111.10177&rft_dat=%3Cproquest_arxiv%3E2600525531%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2600525531&rft_id=info:pmid/&rfr_iscdi=true |