Speech Synthesis with Mixed Emotions
Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2022-12 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Zhou, Kun Sisman, Berrak Rana, Rajib Schuller, B W Li, Haizhou |
description | Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles, but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing, and evaluating mixed emotions in speech. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2701333214</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2701333214</sourcerecordid><originalsourceid>FETCH-proquest_journals_27013332143</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRQCS5ITU3OUAiuzCvJSC3OLFYozyzJUPDNrEhNUXDNzS_JzM8r5mFgTUvMKU7lhdLcDMpuriHOHroFRfmFpanFJfFZ-aVFeUCpeCNzA0NjY2MjQxNj4lQBAMlZLoA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2701333214</pqid></control><display><type>article</type><title>Speech Synthesis with Mixed Emotions</title><source>Free E- Journals</source><creator>Zhou, Kun ; Sisman, Berrak ; Rana, Rajib ; Schuller, B W ; Li, Haizhou</creator><creatorcontrib>Zhou, Kun ; Sisman, Berrak ; Rana, Rajib ; Schuller, B W ; Li, Haizhou</creatorcontrib><description>Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles, but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing, and evaluating mixed emotions in speech.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Emotions ; Evaluation ; Mixtures ; Speech recognition ; Synthesis</subject><ispartof>arXiv.org, 2022-12</ispartof><rights>2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Zhou, Kun</creatorcontrib><creatorcontrib>Sisman, Berrak</creatorcontrib><creatorcontrib>Rana, Rajib</creatorcontrib><creatorcontrib>Schuller, B W</creatorcontrib><creatorcontrib>Li, Haizhou</creatorcontrib><title>Speech Synthesis with Mixed Emotions</title><title>arXiv.org</title><description>Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles, but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing, and evaluating mixed emotions in speech.</description><subject>Emotions</subject><subject>Evaluation</subject><subject>Mixtures</subject><subject>Speech recognition</subject><subject>Synthesis</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRQCS5ITU3OUAiuzCvJSC3OLFYozyzJUPDNrEhNUXDNzS_JzM8r5mFgTUvMKU7lhdLcDMpuriHOHroFRfmFpanFJfFZ-aVFeUCpeCNzA0NjY2MjQxNj4lQBAMlZLoA</recordid><startdate>20221228</startdate><enddate>20221228</enddate><creator>Zhou, Kun</creator><creator>Sisman, Berrak</creator><creator>Rana, Rajib</creator><creator>Schuller, B W</creator><creator>Li, Haizhou</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20221228</creationdate><title>Speech Synthesis with Mixed Emotions</title><author>Zhou, Kun ; Sisman, Berrak ; Rana, Rajib ; Schuller, B W ; Li, Haizhou</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27013332143</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Emotions</topic><topic>Evaluation</topic><topic>Mixtures</topic><topic>Speech recognition</topic><topic>Synthesis</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhou, Kun</creatorcontrib><creatorcontrib>Sisman, Berrak</creatorcontrib><creatorcontrib>Rana, Rajib</creatorcontrib><creatorcontrib>Schuller, B W</creatorcontrib><creatorcontrib>Li, Haizhou</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhou, Kun</au><au>Sisman, Berrak</au><au>Rana, Rajib</au><au>Schuller, B W</au><au>Li, Haizhou</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Speech Synthesis with Mixed Emotions</atitle><jtitle>arXiv.org</jtitle><date>2022-12-28</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles, but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing, and evaluating mixed emotions in speech.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2022-12 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2701333214 |
source | Free E- Journals |
subjects | Emotions Evaluation Mixtures Speech recognition Synthesis |
title | Speech Synthesis with Mixed Emotions |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T21%3A48%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Speech%20Synthesis%20with%20Mixed%20Emotions&rft.jtitle=arXiv.org&rft.au=Zhou,%20Kun&rft.date=2022-12-28&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2701333214%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2701333214&rft_id=info:pmid/&rfr_iscdi=true |