CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis

Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective r...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Deng, Yayue, Xue, Jinlong, Jia, Yukang, Li, Qifei, Han, Yichen, Wang, Fengping, Gao, Yingming, Ke, Dengfeng, Li, Ya
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Human-Computer Interaction
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Deng, Yayue Xue, Jinlong Jia, Yukang Li, Qifei Han, Yichen Wang, Fengping Gao, Yingming Ke, Dengfeng Li, Ya
description	Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model's context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different contrastive learning strategies and comprehensive experiments in comparison with prior CSS systems. Results demonstrate that the synthesized speech from our proposed method exhibits more contextually appropriate and sensitive prosody.
doi_str_mv	10.48550/arxiv.2312.10358
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_10358</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_10358</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-fbecb1249c5bb7d8484340d3a1b12fc187baa2c577978e2bdb583a22a7ffcf393</originalsourceid><addsrcrecordid>eNotj81OhDAUhbtxYUYfwJV9AZC2YIs7g7_JxDFh9uQWbqUJQ0lbyczby-CsvuQk5-R8hNyxLM1VUWQP4I92TrlgPGWZKNQ1CdXuq6rrJ1q5MXoI0c6YaAjYrQke48LD5LHHMVg3UuM8fbEwuJ9fTGCavJu8hYj027vguhO147k5ow8QlwIMtJ4Q257WpzH2GGy4IVcGhoC3F27I_u11X30k2937Z_W8TeBRqsRobDXjedkWWstO5SoXedYJYEtqWqakBuBtIWUpFXLd6UIJ4BykMa0RpdiQ-__Z1bpZbh7An5qzfbPaiz-Z7lfk</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis</title><source>arXiv.org</source><creator>Deng, Yayue ; Xue, Jinlong ; Jia, Yukang ; Li, Qifei ; Han, Yichen ; Wang, Fengping ; Gao, Yingming ; Ke, Dengfeng ; Li, Ya</creator><creatorcontrib>Deng, Yayue ; Xue, Jinlong ; Jia, Yukang ; Li, Qifei ; Han, Yichen ; Wang, Fengping ; Gao, Yingming ; Ke, Dengfeng ; Li, Ya</creatorcontrib><description>Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model's context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different contrastive learning strategies and comprehensive experiments in comparison with prior CSS systems. Results demonstrate that the synthesized speech from our proposed method exhibits more contextually appropriate and sensitive prosody.</description><identifier>DOI: 10.48550/arxiv.2312.10358</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Human-Computer Interaction</subject><creationdate>2023-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.10358$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.10358$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Deng, Yayue</creatorcontrib><creatorcontrib>Xue, Jinlong</creatorcontrib><creatorcontrib>Jia, Yukang</creatorcontrib><creatorcontrib>Li, Qifei</creatorcontrib><creatorcontrib>Han, Yichen</creatorcontrib><creatorcontrib>Wang, Fengping</creatorcontrib><creatorcontrib>Gao, Yingming</creatorcontrib><creatorcontrib>Ke, Dengfeng</creatorcontrib><creatorcontrib>Li, Ya</creatorcontrib><title>CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis</title><description>Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model's context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different contrastive learning strategies and comprehensive experiments in comparison with prior CSS systems. Results demonstrate that the synthesized speech from our proposed method exhibits more contextually appropriate and sensitive prosody.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Human-Computer Interaction</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81OhDAUhbtxYUYfwJV9AZC2YIs7g7_JxDFh9uQWbqUJQ0lbyczby-CsvuQk5-R8hNyxLM1VUWQP4I92TrlgPGWZKNQ1CdXuq6rrJ1q5MXoI0c6YaAjYrQke48LD5LHHMVg3UuM8fbEwuJ9fTGCavJu8hYj027vguhO147k5ow8QlwIMtJ4Q257WpzH2GGy4IVcGhoC3F27I_u11X30k2937Z_W8TeBRqsRobDXjedkWWstO5SoXedYJYEtqWqakBuBtIWUpFXLd6UIJ4BykMa0RpdiQ-__Z1bpZbh7An5qzfbPaiz-Z7lfk</recordid><startdate>20231216</startdate><enddate>20231216</enddate><creator>Deng, Yayue</creator><creator>Xue, Jinlong</creator><creator>Jia, Yukang</creator><creator>Li, Qifei</creator><creator>Han, Yichen</creator><creator>Wang, Fengping</creator><creator>Gao, Yingming</creator><creator>Ke, Dengfeng</creator><creator>Li, Ya</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231216</creationdate><title>CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis</title><author>Deng, Yayue ; Xue, Jinlong ; Jia, Yukang ; Li, Qifei ; Han, Yichen ; Wang, Fengping ; Gao, Yingming ; Ke, Dengfeng ; Li, Ya</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-fbecb1249c5bb7d8484340d3a1b12fc187baa2c577978e2bdb583a22a7ffcf393</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Human-Computer Interaction</topic><toplevel>online_resources</toplevel><creatorcontrib>Deng, Yayue</creatorcontrib><creatorcontrib>Xue, Jinlong</creatorcontrib><creatorcontrib>Jia, Yukang</creatorcontrib><creatorcontrib>Li, Qifei</creatorcontrib><creatorcontrib>Han, Yichen</creatorcontrib><creatorcontrib>Wang, Fengping</creatorcontrib><creatorcontrib>Gao, Yingming</creatorcontrib><creatorcontrib>Ke, Dengfeng</creatorcontrib><creatorcontrib>Li, Ya</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Deng, Yayue</au><au>Xue, Jinlong</au><au>Jia, Yukang</au><au>Li, Qifei</au><au>Han, Yichen</au><au>Wang, Fengping</au><au>Gao, Yingming</au><au>Ke, Dengfeng</au><au>Li, Ya</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis</atitle><date>2023-12-16</date><risdate>2023</risdate><abstract>Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model's context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different contrastive learning strategies and comprehensive experiments in comparison with prior CSS systems. Results demonstrate that the synthesized speech from our proposed method exhibits more contextually appropriate and sensitive prosody.</abstract><doi>10.48550/arxiv.2312.10358</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2312.10358
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2312_10358
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Human-Computer Interaction
title	CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T18%3A53%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CONCSS:%20Contrastive-based%20Context%20Comprehension%20for%20Dialogue-appropriate%20Prosody%20in%20Conversational%20Speech%20Synthesis&rft.au=Deng,%20Yayue&rft.date=2023-12-16&rft_id=info:doi/10.48550/arxiv.2312.10358&rft_dat=%3Carxiv_GOX%3E2312_10358%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true