CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis

To further improve the speaking styles of synthesized speeches, current text-to-speech (TTS) synthesis systems commonly employ reference speeches to stylize their outputs instead of just the input texts. These reference speeches are obtained by manual selection which is resource-consuming, or select...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-08
Hauptverfasser:	Meng, Yi, Li, Xiang, Wu, Zhiyong, Li, Tingtian, Sun, Zixun, Xiao, Xinyu, Sun, Chi, Zhan, Hui, Meng, Helen
Format:	Artikel
Sprache:	eng
Schlagworte:	Semantics Similarity Speaking Speech recognition Speeches Synthesis
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Meng, Yi Li, Xiang Wu, Zhiyong Li, Tingtian Sun, Zixun Xiao, Xinyu Sun, Chi Zhan, Hui Meng, Helen
description	To further improve the speaking styles of synthesized speeches, current text-to-speech (TTS) synthesis systems commonly employ reference speeches to stylize their outputs instead of just the input texts. These reference speeches are obtained by manual selection which is resource-consuming, or selected by semantic features. However, semantic features contain not only style-related information, but also style irrelevant information. The information irrelevant to speaking style in the text could interfere the reference audio selection and result in improper speaking styles. To improve the reference selection, we propose Contrastive Acoustic-Linguistic Module (CALM) to extract the Style-related Text Feature (STF) from the text. CALM optimizes the correlation between the speaking style embedding and the extracted STF with contrastive learning. Thus, a certain number of the most appropriate reference speeches for the input text are selected by retrieving the speeches with the top STF similarities. Then the style embeddings are weighted summarized according to their STF similarities and used to stylize the synthesized speech of TTS. Experiment results demonstrate the effectiveness of our proposed approach, with both objective evaluations and subjective evaluations on the speaking styles of the synthesized speeches outperform a baseline approach with semantic-feature-based reference selection.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2859363787</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2859363787</sourcerecordid><originalsourceid>FETCH-proquest_journals_28593637873</originalsourceid><addsrcrecordid>eNqNi80KgkAURocgSMp3GGg9YDP5U7sQo0WudBsy5DU1c2zuGPr2KfQArT4O53wLYnEhdizYc74iNmLtOA73fO66wiK38HSNjzRUrdESTfUBGmqFyF4qlw1NOpDPqn3QxIwN0Fjl0MxYKE2jodOAOF9SGAwzik053EuajK0pASvckGUhGwT7t2uyPUdpeGGdVu8e0GS16nU7qYwH7kF4wg988V_1Bb1gQ44</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2859363787</pqid></control><display><type>article</type><title>CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis</title><source>Free E- Journals</source><creator>Meng, Yi ; Li, Xiang ; Wu, Zhiyong ; Li, Tingtian ; Sun, Zixun ; Xiao, Xinyu ; Sun, Chi ; Zhan, Hui ; Meng, Helen</creator><creatorcontrib>Meng, Yi ; Li, Xiang ; Wu, Zhiyong ; Li, Tingtian ; Sun, Zixun ; Xiao, Xinyu ; Sun, Chi ; Zhan, Hui ; Meng, Helen</creatorcontrib><description>To further improve the speaking styles of synthesized speeches, current text-to-speech (TTS) synthesis systems commonly employ reference speeches to stylize their outputs instead of just the input texts. These reference speeches are obtained by manual selection which is resource-consuming, or selected by semantic features. However, semantic features contain not only style-related information, but also style irrelevant information. The information irrelevant to speaking style in the text could interfere the reference audio selection and result in improper speaking styles. To improve the reference selection, we propose Contrastive Acoustic-Linguistic Module (CALM) to extract the Style-related Text Feature (STF) from the text. CALM optimizes the correlation between the speaking style embedding and the extracted STF with contrastive learning. Thus, a certain number of the most appropriate reference speeches for the input text are selected by retrieving the speeches with the top STF similarities. Then the style embeddings are weighted summarized according to their STF similarities and used to stylize the synthesized speech of TTS. Experiment results demonstrate the effectiveness of our proposed approach, with both objective evaluations and subjective evaluations on the speaking styles of the synthesized speeches outperform a baseline approach with semantic-feature-based reference selection.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Semantics ; Similarity ; Speaking ; Speech recognition ; Speeches ; Synthesis</subject><ispartof>arXiv.org, 2023-08</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>778,782</link.rule.ids></links><search><creatorcontrib>Meng, Yi</creatorcontrib><creatorcontrib>Li, Xiang</creatorcontrib><creatorcontrib>Wu, Zhiyong</creatorcontrib><creatorcontrib>Li, Tingtian</creatorcontrib><creatorcontrib>Sun, Zixun</creatorcontrib><creatorcontrib>Xiao, Xinyu</creatorcontrib><creatorcontrib>Sun, Chi</creatorcontrib><creatorcontrib>Zhan, Hui</creatorcontrib><creatorcontrib>Meng, Helen</creatorcontrib><title>CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis</title><title>arXiv.org</title><description>To further improve the speaking styles of synthesized speeches, current text-to-speech (TTS) synthesis systems commonly employ reference speeches to stylize their outputs instead of just the input texts. These reference speeches are obtained by manual selection which is resource-consuming, or selected by semantic features. However, semantic features contain not only style-related information, but also style irrelevant information. The information irrelevant to speaking style in the text could interfere the reference audio selection and result in improper speaking styles. To improve the reference selection, we propose Contrastive Acoustic-Linguistic Module (CALM) to extract the Style-related Text Feature (STF) from the text. CALM optimizes the correlation between the speaking style embedding and the extracted STF with contrastive learning. Thus, a certain number of the most appropriate reference speeches for the input text are selected by retrieving the speeches with the top STF similarities. Then the style embeddings are weighted summarized according to their STF similarities and used to stylize the synthesized speech of TTS. Experiment results demonstrate the effectiveness of our proposed approach, with both objective evaluations and subjective evaluations on the speaking styles of the synthesized speeches outperform a baseline approach with semantic-feature-based reference selection.</description><subject>Semantics</subject><subject>Similarity</subject><subject>Speaking</subject><subject>Speech recognition</subject><subject>Speeches</subject><subject>Synthesis</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNi80KgkAURocgSMp3GGg9YDP5U7sQo0WudBsy5DU1c2zuGPr2KfQArT4O53wLYnEhdizYc74iNmLtOA73fO66wiK38HSNjzRUrdESTfUBGmqFyF4qlw1NOpDPqn3QxIwN0Fjl0MxYKE2jodOAOF9SGAwzik053EuajK0pASvckGUhGwT7t2uyPUdpeGGdVu8e0GS16nU7qYwH7kF4wg988V_1Bb1gQ44</recordid><startdate>20230830</startdate><enddate>20230830</enddate><creator>Meng, Yi</creator><creator>Li, Xiang</creator><creator>Wu, Zhiyong</creator><creator>Li, Tingtian</creator><creator>Sun, Zixun</creator><creator>Xiao, Xinyu</creator><creator>Sun, Chi</creator><creator>Zhan, Hui</creator><creator>Meng, Helen</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PTHSS</scope></search><sort><creationdate>20230830</creationdate><title>CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis</title><author>Meng, Yi ; Li, Xiang ; Wu, Zhiyong ; Li, Tingtian ; Sun, Zixun ; Xiao, Xinyu ; Sun, Chi ; Zhan, Hui ; Meng, Helen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28593637873</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Semantics</topic><topic>Similarity</topic><topic>Speaking</topic><topic>Speech recognition</topic><topic>Speeches</topic><topic>Synthesis</topic><toplevel>online_resources</toplevel><creatorcontrib>Meng, Yi</creatorcontrib><creatorcontrib>Li, Xiang</creatorcontrib><creatorcontrib>Wu, Zhiyong</creatorcontrib><creatorcontrib>Li, Tingtian</creatorcontrib><creatorcontrib>Sun, Zixun</creatorcontrib><creatorcontrib>Xiao, Xinyu</creatorcontrib><creatorcontrib>Sun, Chi</creatorcontrib><creatorcontrib>Zhan, Hui</creatorcontrib><creatorcontrib>Meng, Helen</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Meng, Yi</au><au>Li, Xiang</au><au>Wu, Zhiyong</au><au>Li, Tingtian</au><au>Sun, Zixun</au><au>Xiao, Xinyu</au><au>Sun, Chi</au><au>Zhan, Hui</au><au>Meng, Helen</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis</atitle><jtitle>arXiv.org</jtitle><date>2023-08-30</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>To further improve the speaking styles of synthesized speeches, current text-to-speech (TTS) synthesis systems commonly employ reference speeches to stylize their outputs instead of just the input texts. These reference speeches are obtained by manual selection which is resource-consuming, or selected by semantic features. However, semantic features contain not only style-related information, but also style irrelevant information. The information irrelevant to speaking style in the text could interfere the reference audio selection and result in improper speaking styles. To improve the reference selection, we propose Contrastive Acoustic-Linguistic Module (CALM) to extract the Style-related Text Feature (STF) from the text. CALM optimizes the correlation between the speaking style embedding and the extracted STF with contrastive learning. Thus, a certain number of the most appropriate reference speeches for the input text are selected by retrieving the speeches with the top STF similarities. Then the style embeddings are weighted summarized according to their STF similarities and used to stylize the synthesized speech of TTS. Experiment results demonstrate the effectiveness of our proposed approach, with both objective evaluations and subjective evaluations on the speaking styles of the synthesized speeches outperform a baseline approach with semantic-feature-based reference selection.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-08
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2859363787
source	Free E- Journals
subjects	Semantics Similarity Speaking Speech recognition Speeches Synthesis
title	CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T03%3A10%3A50IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=CALM:%20Contrastive%20Cross-modal%20Speaking%20Style%20Modeling%20for%20Expressive%20Text-to-Speech%20Synthesis&rft.jtitle=arXiv.org&rft.au=Meng,%20Yi&rft.date=2023-08-30&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2859363787%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2859363787&rft_id=info:pmid/&rfr_iscdi=true