A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis

This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text. To obtain fine-grained emotional prosodic features as predictive values for our model, we extract a phoneme-level Local Prosody Embedding sequence (LPEs) and a Global Style Embe...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-09
Hauptverfasser:	Xianhao Wei, Jia, Jia, Li, Xiang, Wu, Zhiyong, Wang, Ziyi
Format:	Artikel
Sprache:	eng
Schlagworte:	Datasets Embedding Emotions Linguistics Speech Speech recognition Synthesis User experience
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Xianhao Wei Jia, Jia Li, Xiang Wu, Zhiyong Wang, Ziyi
description	This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text. To obtain fine-grained emotional prosodic features as predictive values for our model, we extract a phoneme-level Local Prosody Embedding sequence (LPEs) and a Global Style Embedding as prosodic speech features from the speech with the help of a style transfer model. We propose a Discourse-level Multi-scale text Prosodic Model (D-MPM) that exploits multi-scale text to predict these two prosodic features. The proposed model can be used to analyze better emotional prosodic features and thus guide the speech synthesis model to synthesize more expressive speech. To quantitatively evaluate the proposed model, we contribute a new and large-scale Discourse-level Chinese Audiobook (DCA) dataset with more than 13,000 utterances annotated sequences to evaluate the proposed model. Experimental results on the DCA dataset show that the multi-scale text information effectively helps to predict prosodic features, and the discourse-level text improves both the overall coherence and the user experience. More interestingly, although we aim at the synthesis effect of the style transfer model, the synthesized speech by the proposed text prosodic analysis model is even better than the style transfer from the original speech in some user evaluation indicators.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2867414797</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2867414797</sourcerecordid><originalsourceid>FETCH-proquest_journals_28674147973</originalsourceid><addsrcrecordid>eNqNjUELgjAYQEcQJOV_GHQe6DadHaWULkaH7jJ0xmT51T4N-vft0A_o9A7vwVuRiAuRskJyviEx4pgkCc8VzzIRkUtJTxY7WDwa5szbONosbrYMO-0MvXpA6G1HG-iDGsDT2k6G3b0O6Gn1gNnCRMtJuw9a3JH1oB2a-Mct2dfV7XhmTw-vxeDcjmEVYmx5kSuZSnVQ4r_qC5HnPds</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2867414797</pqid></control><display><type>article</type><title>A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis</title><source>Free E- Journals</source><creator>Xianhao Wei ; Jia, Jia ; Li, Xiang ; Wu, Zhiyong ; Wang, Ziyi</creator><creatorcontrib>Xianhao Wei ; Jia, Jia ; Li, Xiang ; Wu, Zhiyong ; Wang, Ziyi</creatorcontrib><description>This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text. To obtain fine-grained emotional prosodic features as predictive values for our model, we extract a phoneme-level Local Prosody Embedding sequence (LPEs) and a Global Style Embedding as prosodic speech features from the speech with the help of a style transfer model. We propose a Discourse-level Multi-scale text Prosodic Model (D-MPM) that exploits multi-scale text to predict these two prosodic features. The proposed model can be used to analyze better emotional prosodic features and thus guide the speech synthesis model to synthesize more expressive speech. To quantitatively evaluate the proposed model, we contribute a new and large-scale Discourse-level Chinese Audiobook (DCA) dataset with more than 13,000 utterances annotated sequences to evaluate the proposed model. Experimental results on the DCA dataset show that the multi-scale text information effectively helps to predict prosodic features, and the discourse-level text improves both the overall coherence and the user experience. More interestingly, although we aim at the synthesis effect of the style transfer model, the synthesized speech by the proposed text prosodic analysis model is even better than the style transfer from the original speech in some user evaluation indicators.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Datasets ; Embedding ; Emotions ; Linguistics ; Speech ; Speech recognition ; Synthesis ; User experience</subject><ispartof>arXiv.org, 2023-09</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Xianhao Wei</creatorcontrib><creatorcontrib>Jia, Jia</creatorcontrib><creatorcontrib>Li, Xiang</creatorcontrib><creatorcontrib>Wu, Zhiyong</creatorcontrib><creatorcontrib>Wang, Ziyi</creatorcontrib><title>A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis</title><title>arXiv.org</title><description>This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text. To obtain fine-grained emotional prosodic features as predictive values for our model, we extract a phoneme-level Local Prosody Embedding sequence (LPEs) and a Global Style Embedding as prosodic speech features from the speech with the help of a style transfer model. We propose a Discourse-level Multi-scale text Prosodic Model (D-MPM) that exploits multi-scale text to predict these two prosodic features. The proposed model can be used to analyze better emotional prosodic features and thus guide the speech synthesis model to synthesize more expressive speech. To quantitatively evaluate the proposed model, we contribute a new and large-scale Discourse-level Chinese Audiobook (DCA) dataset with more than 13,000 utterances annotated sequences to evaluate the proposed model. Experimental results on the DCA dataset show that the multi-scale text information effectively helps to predict prosodic features, and the discourse-level text improves both the overall coherence and the user experience. More interestingly, although we aim at the synthesis effect of the style transfer model, the synthesized speech by the proposed text prosodic analysis model is even better than the style transfer from the original speech in some user evaluation indicators.</description><subject>Datasets</subject><subject>Embedding</subject><subject>Emotions</subject><subject>Linguistics</subject><subject>Speech</subject><subject>Speech recognition</subject><subject>Synthesis</subject><subject>User experience</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjUELgjAYQEcQJOV_GHQe6DadHaWULkaH7jJ0xmT51T4N-vft0A_o9A7vwVuRiAuRskJyviEx4pgkCc8VzzIRkUtJTxY7WDwa5szbONosbrYMO-0MvXpA6G1HG-iDGsDT2k6G3b0O6Gn1gNnCRMtJuw9a3JH1oB2a-Mct2dfV7XhmTw-vxeDcjmEVYmx5kSuZSnVQ4r_qC5HnPds</recordid><startdate>20230921</startdate><enddate>20230921</enddate><creator>Xianhao Wei</creator><creator>Jia, Jia</creator><creator>Li, Xiang</creator><creator>Wu, Zhiyong</creator><creator>Wang, Ziyi</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PTHSS</scope></search><sort><creationdate>20230921</creationdate><title>A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis</title><author>Xianhao Wei ; Jia, Jia ; Li, Xiang ; Wu, Zhiyong ; Wang, Ziyi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28674147973</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Datasets</topic><topic>Embedding</topic><topic>Emotions</topic><topic>Linguistics</topic><topic>Speech</topic><topic>Speech recognition</topic><topic>Synthesis</topic><topic>User experience</topic><toplevel>online_resources</toplevel><creatorcontrib>Xianhao Wei</creatorcontrib><creatorcontrib>Jia, Jia</creatorcontrib><creatorcontrib>Li, Xiang</creatorcontrib><creatorcontrib>Wu, Zhiyong</creatorcontrib><creatorcontrib>Wang, Ziyi</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Xianhao Wei</au><au>Jia, Jia</au><au>Li, Xiang</au><au>Wu, Zhiyong</au><au>Wang, Ziyi</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis</atitle><jtitle>arXiv.org</jtitle><date>2023-09-21</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text. To obtain fine-grained emotional prosodic features as predictive values for our model, we extract a phoneme-level Local Prosody Embedding sequence (LPEs) and a Global Style Embedding as prosodic speech features from the speech with the help of a style transfer model. We propose a Discourse-level Multi-scale text Prosodic Model (D-MPM) that exploits multi-scale text to predict these two prosodic features. The proposed model can be used to analyze better emotional prosodic features and thus guide the speech synthesis model to synthesize more expressive speech. To quantitatively evaluate the proposed model, we contribute a new and large-scale Discourse-level Chinese Audiobook (DCA) dataset with more than 13,000 utterances annotated sequences to evaluate the proposed model. Experimental results on the DCA dataset show that the multi-scale text information effectively helps to predict prosodic features, and the discourse-level text improves both the overall coherence and the user experience. More interestingly, although we aim at the synthesis effect of the style transfer model, the synthesized speech by the proposed text prosodic analysis model is even better than the style transfer from the original speech in some user evaluation indicators.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-09
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2867414797
source	Free E- Journals
subjects	Datasets Embedding Emotions Linguistics Speech Speech recognition Synthesis User experience
title	A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-23T19%3A47%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=A%20Discourse-level%20Multi-scale%20Prosodic%20Model%20for%20Fine-grained%20Emotion%20Analysis&rft.jtitle=arXiv.org&rft.au=Xianhao%20Wei&rft.date=2023-09-21&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2867414797%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2867414797&rft_id=info:pmid/&rfr_iscdi=true