CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis
Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective r...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Deng, Yayue Xue, Jinlong Jia, Yukang Li, Qifei Han, Yichen Wang, Fengping Gao, Yingming Ke, Dengfeng Li, Ya |
description | Conversational speech synthesis (CSS) incorporates historical dialogue as
supplementary information with the aim of generating speech that has
dialogue-appropriate prosody. While previous methods have already delved into
enhancing context comprehension, context representation still lacks effective
representation capabilities and context-sensitive discriminability. In this
paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within
this framework, we define an innovative pretext task specific to CSS that
enables the model to perform self-supervised learning on unlabeled
conversational datasets to boost the model's context understanding.
Additionally, we introduce a sampling strategy for negative sample augmentation
to enhance context vectors' discriminability. This is the first attempt to
integrate contrastive learning into CSS. We conduct ablation studies on
different contrastive learning strategies and comprehensive experiments in
comparison with prior CSS systems. Results demonstrate that the synthesized
speech from our proposed method exhibits more contextually appropriate and
sensitive prosody. |
doi_str_mv | 10.48550/arxiv.2312.10358 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_10358</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_10358</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-fbecb1249c5bb7d8484340d3a1b12fc187baa2c577978e2bdb583a22a7ffcf393</originalsourceid><addsrcrecordid>eNotj81OhDAUhbtxYUYfwJV9AZC2YIs7g7_JxDFh9uQWbqUJQ0lbyczby-CsvuQk5-R8hNyxLM1VUWQP4I92TrlgPGWZKNQ1CdXuq6rrJ1q5MXoI0c6YaAjYrQke48LD5LHHMVg3UuM8fbEwuJ9fTGCavJu8hYj027vguhO147k5ow8QlwIMtJ4Q257WpzH2GGy4IVcGhoC3F27I_u11X30k2937Z_W8TeBRqsRobDXjedkWWstO5SoXedYJYEtqWqakBuBtIWUpFXLd6UIJ4BykMa0RpdiQ-__Z1bpZbh7An5qzfbPaiz-Z7lfk</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis</title><source>arXiv.org</source><creator>Deng, Yayue ; Xue, Jinlong ; Jia, Yukang ; Li, Qifei ; Han, Yichen ; Wang, Fengping ; Gao, Yingming ; Ke, Dengfeng ; Li, Ya</creator><creatorcontrib>Deng, Yayue ; Xue, Jinlong ; Jia, Yukang ; Li, Qifei ; Han, Yichen ; Wang, Fengping ; Gao, Yingming ; Ke, Dengfeng ; Li, Ya</creatorcontrib><description>Conversational speech synthesis (CSS) incorporates historical dialogue as
supplementary information with the aim of generating speech that has
dialogue-appropriate prosody. While previous methods have already delved into
enhancing context comprehension, context representation still lacks effective
representation capabilities and context-sensitive discriminability. In this
paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within
this framework, we define an innovative pretext task specific to CSS that
enables the model to perform self-supervised learning on unlabeled
conversational datasets to boost the model's context understanding.
Additionally, we introduce a sampling strategy for negative sample augmentation
to enhance context vectors' discriminability. This is the first attempt to
integrate contrastive learning into CSS. We conduct ablation studies on
different contrastive learning strategies and comprehensive experiments in
comparison with prior CSS systems. Results demonstrate that the synthesized
speech from our proposed method exhibits more contextually appropriate and
sensitive prosody.</description><identifier>DOI: 10.48550/arxiv.2312.10358</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Human-Computer Interaction</subject><creationdate>2023-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.10358$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.10358$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Deng, Yayue</creatorcontrib><creatorcontrib>Xue, Jinlong</creatorcontrib><creatorcontrib>Jia, Yukang</creatorcontrib><creatorcontrib>Li, Qifei</creatorcontrib><creatorcontrib>Han, Yichen</creatorcontrib><creatorcontrib>Wang, Fengping</creatorcontrib><creatorcontrib>Gao, Yingming</creatorcontrib><creatorcontrib>Ke, Dengfeng</creatorcontrib><creatorcontrib>Li, Ya</creatorcontrib><title>CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis</title><description>Conversational speech synthesis (CSS) incorporates historical dialogue as
supplementary information with the aim of generating speech that has
dialogue-appropriate prosody. While previous methods have already delved into
enhancing context comprehension, context representation still lacks effective
representation capabilities and context-sensitive discriminability. In this
paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within
this framework, we define an innovative pretext task specific to CSS that
enables the model to perform self-supervised learning on unlabeled
conversational datasets to boost the model's context understanding.
Additionally, we introduce a sampling strategy for negative sample augmentation
to enhance context vectors' discriminability. This is the first attempt to
integrate contrastive learning into CSS. We conduct ablation studies on
different contrastive learning strategies and comprehensive experiments in
comparison with prior CSS systems. Results demonstrate that the synthesized
speech from our proposed method exhibits more contextually appropriate and
sensitive prosody.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Human-Computer Interaction</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81OhDAUhbtxYUYfwJV9AZC2YIs7g7_JxDFh9uQWbqUJQ0lbyczby-CsvuQk5-R8hNyxLM1VUWQP4I92TrlgPGWZKNQ1CdXuq6rrJ1q5MXoI0c6YaAjYrQke48LD5LHHMVg3UuM8fbEwuJ9fTGCavJu8hYj027vguhO147k5ow8QlwIMtJ4Q257WpzH2GGy4IVcGhoC3F27I_u11X30k2937Z_W8TeBRqsRobDXjedkWWstO5SoXedYJYEtqWqakBuBtIWUpFXLd6UIJ4BykMa0RpdiQ-__Z1bpZbh7An5qzfbPaiz-Z7lfk</recordid><startdate>20231216</startdate><enddate>20231216</enddate><creator>Deng, Yayue</creator><creator>Xue, Jinlong</creator><creator>Jia, Yukang</creator><creator>Li, Qifei</creator><creator>Han, Yichen</creator><creator>Wang, Fengping</creator><creator>Gao, Yingming</creator><creator>Ke, Dengfeng</creator><creator>Li, Ya</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231216</creationdate><title>CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis</title><author>Deng, Yayue ; Xue, Jinlong ; Jia, Yukang ; Li, Qifei ; Han, Yichen ; Wang, Fengping ; Gao, Yingming ; Ke, Dengfeng ; Li, Ya</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-fbecb1249c5bb7d8484340d3a1b12fc187baa2c577978e2bdb583a22a7ffcf393</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Human-Computer Interaction</topic><toplevel>online_resources</toplevel><creatorcontrib>Deng, Yayue</creatorcontrib><creatorcontrib>Xue, Jinlong</creatorcontrib><creatorcontrib>Jia, Yukang</creatorcontrib><creatorcontrib>Li, Qifei</creatorcontrib><creatorcontrib>Han, Yichen</creatorcontrib><creatorcontrib>Wang, Fengping</creatorcontrib><creatorcontrib>Gao, Yingming</creatorcontrib><creatorcontrib>Ke, Dengfeng</creatorcontrib><creatorcontrib>Li, Ya</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Deng, Yayue</au><au>Xue, Jinlong</au><au>Jia, Yukang</au><au>Li, Qifei</au><au>Han, Yichen</au><au>Wang, Fengping</au><au>Gao, Yingming</au><au>Ke, Dengfeng</au><au>Li, Ya</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis</atitle><date>2023-12-16</date><risdate>2023</risdate><abstract>Conversational speech synthesis (CSS) incorporates historical dialogue as
supplementary information with the aim of generating speech that has
dialogue-appropriate prosody. While previous methods have already delved into
enhancing context comprehension, context representation still lacks effective
representation capabilities and context-sensitive discriminability. In this
paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within
this framework, we define an innovative pretext task specific to CSS that
enables the model to perform self-supervised learning on unlabeled
conversational datasets to boost the model's context understanding.
Additionally, we introduce a sampling strategy for negative sample augmentation
to enhance context vectors' discriminability. This is the first attempt to
integrate contrastive learning into CSS. We conduct ablation studies on
different contrastive learning strategies and comprehensive experiments in
comparison with prior CSS systems. Results demonstrate that the synthesized
speech from our proposed method exhibits more contextually appropriate and
sensitive prosody.</abstract><doi>10.48550/arxiv.2312.10358</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2312.10358 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2312_10358 |
source | arXiv.org |
subjects | Computer Science - Computation and Language Computer Science - Human-Computer Interaction |
title | CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T18%3A53%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CONCSS:%20Contrastive-based%20Context%20Comprehension%20for%20Dialogue-appropriate%20Prosody%20in%20Conversational%20Speech%20Synthesis&rft.au=Deng,%20Yayue&rft.date=2023-12-16&rft_id=info:doi/10.48550/arxiv.2312.10358&rft_dat=%3Carxiv_GOX%3E2312_10358%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |