CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis

Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective r...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Deng, Yayue, Xue, Jinlong, Jia, Yukang, Li, Qifei, Han, Yichen, Wang, Fengping, Gao, Yingming, Ke, Dengfeng, Li, Ya
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Deng, Yayue
Xue, Jinlong
Jia, Yukang
Li, Qifei
Han, Yichen
Wang, Fengping
Gao, Yingming
Ke, Dengfeng
Li, Ya
description Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model's context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different contrastive learning strategies and comprehensive experiments in comparison with prior CSS systems. Results demonstrate that the synthesized speech from our proposed method exhibits more contextually appropriate and sensitive prosody.
doi_str_mv 10.48550/arxiv.2312.10358
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_10358</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_10358</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-fbecb1249c5bb7d8484340d3a1b12fc187baa2c577978e2bdb583a22a7ffcf393</originalsourceid><addsrcrecordid>eNotj81OhDAUhbtxYUYfwJV9AZC2YIs7g7_JxDFh9uQWbqUJQ0lbyczby-CsvuQk5-R8hNyxLM1VUWQP4I92TrlgPGWZKNQ1CdXuq6rrJ1q5MXoI0c6YaAjYrQke48LD5LHHMVg3UuM8fbEwuJ9fTGCavJu8hYj027vguhO147k5ow8QlwIMtJ4Q257WpzH2GGy4IVcGhoC3F27I_u11X30k2937Z_W8TeBRqsRobDXjedkWWstO5SoXedYJYEtqWqakBuBtIWUpFXLd6UIJ4BykMa0RpdiQ-__Z1bpZbh7An5qzfbPaiz-Z7lfk</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis</title><source>arXiv.org</source><creator>Deng, Yayue ; Xue, Jinlong ; Jia, Yukang ; Li, Qifei ; Han, Yichen ; Wang, Fengping ; Gao, Yingming ; Ke, Dengfeng ; Li, Ya</creator><creatorcontrib>Deng, Yayue ; Xue, Jinlong ; Jia, Yukang ; Li, Qifei ; Han, Yichen ; Wang, Fengping ; Gao, Yingming ; Ke, Dengfeng ; Li, Ya</creatorcontrib><description>Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model's context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different contrastive learning strategies and comprehensive experiments in comparison with prior CSS systems. Results demonstrate that the synthesized speech from our proposed method exhibits more contextually appropriate and sensitive prosody.</description><identifier>DOI: 10.48550/arxiv.2312.10358</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Human-Computer Interaction</subject><creationdate>2023-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.10358$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.10358$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Deng, Yayue</creatorcontrib><creatorcontrib>Xue, Jinlong</creatorcontrib><creatorcontrib>Jia, Yukang</creatorcontrib><creatorcontrib>Li, Qifei</creatorcontrib><creatorcontrib>Han, Yichen</creatorcontrib><creatorcontrib>Wang, Fengping</creatorcontrib><creatorcontrib>Gao, Yingming</creatorcontrib><creatorcontrib>Ke, Dengfeng</creatorcontrib><creatorcontrib>Li, Ya</creatorcontrib><title>CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis</title><description>Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model's context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different contrastive learning strategies and comprehensive experiments in comparison with prior CSS systems. Results demonstrate that the synthesized speech from our proposed method exhibits more contextually appropriate and sensitive prosody.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Human-Computer Interaction</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81OhDAUhbtxYUYfwJV9AZC2YIs7g7_JxDFh9uQWbqUJQ0lbyczby-CsvuQk5-R8hNyxLM1VUWQP4I92TrlgPGWZKNQ1CdXuq6rrJ1q5MXoI0c6YaAjYrQke48LD5LHHMVg3UuM8fbEwuJ9fTGCavJu8hYj027vguhO147k5ow8QlwIMtJ4Q257WpzH2GGy4IVcGhoC3F27I_u11X30k2937Z_W8TeBRqsRobDXjedkWWstO5SoXedYJYEtqWqakBuBtIWUpFXLd6UIJ4BykMa0RpdiQ-__Z1bpZbh7An5qzfbPaiz-Z7lfk</recordid><startdate>20231216</startdate><enddate>20231216</enddate><creator>Deng, Yayue</creator><creator>Xue, Jinlong</creator><creator>Jia, Yukang</creator><creator>Li, Qifei</creator><creator>Han, Yichen</creator><creator>Wang, Fengping</creator><creator>Gao, Yingming</creator><creator>Ke, Dengfeng</creator><creator>Li, Ya</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231216</creationdate><title>CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis</title><author>Deng, Yayue ; Xue, Jinlong ; Jia, Yukang ; Li, Qifei ; Han, Yichen ; Wang, Fengping ; Gao, Yingming ; Ke, Dengfeng ; Li, Ya</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-fbecb1249c5bb7d8484340d3a1b12fc187baa2c577978e2bdb583a22a7ffcf393</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Human-Computer Interaction</topic><toplevel>online_resources</toplevel><creatorcontrib>Deng, Yayue</creatorcontrib><creatorcontrib>Xue, Jinlong</creatorcontrib><creatorcontrib>Jia, Yukang</creatorcontrib><creatorcontrib>Li, Qifei</creatorcontrib><creatorcontrib>Han, Yichen</creatorcontrib><creatorcontrib>Wang, Fengping</creatorcontrib><creatorcontrib>Gao, Yingming</creatorcontrib><creatorcontrib>Ke, Dengfeng</creatorcontrib><creatorcontrib>Li, Ya</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Deng, Yayue</au><au>Xue, Jinlong</au><au>Jia, Yukang</au><au>Li, Qifei</au><au>Han, Yichen</au><au>Wang, Fengping</au><au>Gao, Yingming</au><au>Ke, Dengfeng</au><au>Li, Ya</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis</atitle><date>2023-12-16</date><risdate>2023</risdate><abstract>Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model's context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different contrastive learning strategies and comprehensive experiments in comparison with prior CSS systems. Results demonstrate that the synthesized speech from our proposed method exhibits more contextually appropriate and sensitive prosody.</abstract><doi>10.48550/arxiv.2312.10358</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2312.10358
ispartof
issn
language eng
recordid cdi_arxiv_primary_2312_10358
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Human-Computer Interaction
title CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T18%3A53%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CONCSS:%20Contrastive-based%20Context%20Comprehension%20for%20Dialogue-appropriate%20Prosody%20in%20Conversational%20Speech%20Synthesis&rft.au=Deng,%20Yayue&rft.date=2023-12-16&rft_id=info:doi/10.48550/arxiv.2312.10358&rft_dat=%3Carxiv_GOX%3E2312_10358%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true