UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge

We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Nakata, Wataru, Yamauchi, Kazuki, Yang, Dong, Hyodo, Hiroaki, Saito, Yuki
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Nakata, Wataru Yamauchi, Kazuki Yang, Dong Hyodo, Hiroaki Saito, Yuki
description	We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.
doi_str_mv	10.48550/arxiv.2403.13720
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2403_13720</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2403_13720</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-8b0fd4c2a646173a51981357a5a26f08c81d411a52c0ceee0bea17f32eda05b53</originalsourceid><addsrcrecordid>eNo1j81Og0AUhWfjwlQfwJXzAuCdP6DdGepPExJNgDW5DJdCpNDMoJG3N2LdnHMWX07yMXYnINSJMfCA7rv_CqUGFQoVS7hmXVnsyzzf8bKYPpYpyNF9ZljzfPEznXg7OX4YZ3L-TGQ7CVLzfJ383U2WvO_HIy_X3PfeOprpHyjHfuZph8NA45Fu2FWLg6fbS29Y8fxUpK9B9vZySB-zAKMYgqSGttFWYqQjESs0YpsIZWI0KKMWEpuIRguBRlqwRAQ1oYhbJalBMLVRG3b_d7uqVmfXn9At1a9ytSqrH4j5UOI</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge</title><source>arXiv.org</source><creator>Nakata, Wataru ; Yamauchi, Kazuki ; Yang, Dong ; Hyodo, Hiroaki ; Saito, Yuki</creator><creatorcontrib>Nakata, Wataru ; Yamauchi, Kazuki ; Yang, Dong ; Hyodo, Hiroaki ; Saito, Yuki</creatorcontrib><description>We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.</description><identifier>DOI: 10.48550/arxiv.2403.13720</identifier><language>eng</language><subject>Computer Science - Sound</subject><creationdate>2024-03</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2403.13720$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.13720$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Nakata, Wataru</creatorcontrib><creatorcontrib>Yamauchi, Kazuki</creatorcontrib><creatorcontrib>Yang, Dong</creatorcontrib><creatorcontrib>Hyodo, Hiroaki</creatorcontrib><creatorcontrib>Saito, Yuki</creatorcontrib><title>UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge</title><description>We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.</description><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNo1j81Og0AUhWfjwlQfwJXzAuCdP6DdGepPExJNgDW5DJdCpNDMoJG3N2LdnHMWX07yMXYnINSJMfCA7rv_CqUGFQoVS7hmXVnsyzzf8bKYPpYpyNF9ZljzfPEznXg7OX4YZ3L-TGQ7CVLzfJ383U2WvO_HIy_X3PfeOprpHyjHfuZph8NA45Fu2FWLg6fbS29Y8fxUpK9B9vZySB-zAKMYgqSGttFWYqQjESs0YpsIZWI0KKMWEpuIRguBRlqwRAQ1oYhbJalBMLVRG3b_d7uqVmfXn9At1a9ytSqrH4j5UOI</recordid><startdate>20240320</startdate><enddate>20240320</enddate><creator>Nakata, Wataru</creator><creator>Yamauchi, Kazuki</creator><creator>Yang, Dong</creator><creator>Hyodo, Hiroaki</creator><creator>Saito, Yuki</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240320</creationdate><title>UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge</title><author>Nakata, Wataru ; Yamauchi, Kazuki ; Yang, Dong ; Hyodo, Hiroaki ; Saito, Yuki</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-8b0fd4c2a646173a51981357a5a26f08c81d411a52c0ceee0bea17f32eda05b53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Nakata, Wataru</creatorcontrib><creatorcontrib>Yamauchi, Kazuki</creatorcontrib><creatorcontrib>Yang, Dong</creatorcontrib><creatorcontrib>Hyodo, Hiroaki</creatorcontrib><creatorcontrib>Saito, Yuki</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Nakata, Wataru</au><au>Yamauchi, Kazuki</au><au>Yang, Dong</au><au>Hyodo, Hiroaki</au><au>Saito, Yuki</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge</atitle><date>2024-03-20</date><risdate>2024</risdate><abstract>We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.</abstract><doi>10.48550/arxiv.2403.13720</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2403.13720
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2403_13720
source	arXiv.org
subjects	Computer Science - Sound
title	UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-15T14%3A15%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=UTDUSS:%20UTokyo-SaruLab%20System%20for%20Interspeech2024%20Speech%20Processing%20Using%20Discrete%20Speech%20Unit%20Challenge&rft.au=Nakata,%20Wataru&rft.date=2024-03-20&rft_id=info:doi/10.48550/arxiv.2403.13720&rft_dat=%3Carxiv_GOX%3E2403_13720%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true