UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge
We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Nakata, Wataru Yamauchi, Kazuki Yang, Dong Hyodo, Hiroaki Saito, Yuki |
description | We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024
Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses
on using discrete speech unit learned from large speech corpora for some tasks.
We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and
Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained
on only speech corpora, which makes the learned codec represent rich acoustic
features that are necessary for high-fidelity speech reconstruction. For the
acoustic+vocoder track, we trained an acoustic model based on Transformer
encoder-decoder that predicted the pre-trained NAC tokens from text input. We
describe our strategies to build these models, such as data selection,
downsampling, and hyper-parameter tuning. Our system ranked in second and first
for the Vocoder and Acoustic+Vocoder tracks, respectively. |
doi_str_mv | 10.48550/arxiv.2403.13720 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2403_13720</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2403_13720</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-8b0fd4c2a646173a51981357a5a26f08c81d411a52c0ceee0bea17f32eda05b53</originalsourceid><addsrcrecordid>eNo1j81Og0AUhWfjwlQfwJXzAuCdP6DdGepPExJNgDW5DJdCpNDMoJG3N2LdnHMWX07yMXYnINSJMfCA7rv_CqUGFQoVS7hmXVnsyzzf8bKYPpYpyNF9ZljzfPEznXg7OX4YZ3L-TGQ7CVLzfJ383U2WvO_HIy_X3PfeOprpHyjHfuZph8NA45Fu2FWLg6fbS29Y8fxUpK9B9vZySB-zAKMYgqSGttFWYqQjESs0YpsIZWI0KKMWEpuIRguBRlqwRAQ1oYhbJalBMLVRG3b_d7uqVmfXn9At1a9ytSqrH4j5UOI</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge</title><source>arXiv.org</source><creator>Nakata, Wataru ; Yamauchi, Kazuki ; Yang, Dong ; Hyodo, Hiroaki ; Saito, Yuki</creator><creatorcontrib>Nakata, Wataru ; Yamauchi, Kazuki ; Yang, Dong ; Hyodo, Hiroaki ; Saito, Yuki</creatorcontrib><description>We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024
Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses
on using discrete speech unit learned from large speech corpora for some tasks.
We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and
Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained
on only speech corpora, which makes the learned codec represent rich acoustic
features that are necessary for high-fidelity speech reconstruction. For the
acoustic+vocoder track, we trained an acoustic model based on Transformer
encoder-decoder that predicted the pre-trained NAC tokens from text input. We
describe our strategies to build these models, such as data selection,
downsampling, and hyper-parameter tuning. Our system ranked in second and first
for the Vocoder and Acoustic+Vocoder tracks, respectively.</description><identifier>DOI: 10.48550/arxiv.2403.13720</identifier><language>eng</language><subject>Computer Science - Sound</subject><creationdate>2024-03</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2403.13720$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.13720$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Nakata, Wataru</creatorcontrib><creatorcontrib>Yamauchi, Kazuki</creatorcontrib><creatorcontrib>Yang, Dong</creatorcontrib><creatorcontrib>Hyodo, Hiroaki</creatorcontrib><creatorcontrib>Saito, Yuki</creatorcontrib><title>UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge</title><description>We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024
Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses
on using discrete speech unit learned from large speech corpora for some tasks.
We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and
Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained
on only speech corpora, which makes the learned codec represent rich acoustic
features that are necessary for high-fidelity speech reconstruction. For the
acoustic+vocoder track, we trained an acoustic model based on Transformer
encoder-decoder that predicted the pre-trained NAC tokens from text input. We
describe our strategies to build these models, such as data selection,
downsampling, and hyper-parameter tuning. Our system ranked in second and first
for the Vocoder and Acoustic+Vocoder tracks, respectively.</description><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNo1j81Og0AUhWfjwlQfwJXzAuCdP6DdGepPExJNgDW5DJdCpNDMoJG3N2LdnHMWX07yMXYnINSJMfCA7rv_CqUGFQoVS7hmXVnsyzzf8bKYPpYpyNF9ZljzfPEznXg7OX4YZ3L-TGQ7CVLzfJ383U2WvO_HIy_X3PfeOprpHyjHfuZph8NA45Fu2FWLg6fbS29Y8fxUpK9B9vZySB-zAKMYgqSGttFWYqQjESs0YpsIZWI0KKMWEpuIRguBRlqwRAQ1oYhbJalBMLVRG3b_d7uqVmfXn9At1a9ytSqrH4j5UOI</recordid><startdate>20240320</startdate><enddate>20240320</enddate><creator>Nakata, Wataru</creator><creator>Yamauchi, Kazuki</creator><creator>Yang, Dong</creator><creator>Hyodo, Hiroaki</creator><creator>Saito, Yuki</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240320</creationdate><title>UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge</title><author>Nakata, Wataru ; Yamauchi, Kazuki ; Yang, Dong ; Hyodo, Hiroaki ; Saito, Yuki</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-8b0fd4c2a646173a51981357a5a26f08c81d411a52c0ceee0bea17f32eda05b53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Nakata, Wataru</creatorcontrib><creatorcontrib>Yamauchi, Kazuki</creatorcontrib><creatorcontrib>Yang, Dong</creatorcontrib><creatorcontrib>Hyodo, Hiroaki</creatorcontrib><creatorcontrib>Saito, Yuki</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Nakata, Wataru</au><au>Yamauchi, Kazuki</au><au>Yang, Dong</au><au>Hyodo, Hiroaki</au><au>Saito, Yuki</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge</atitle><date>2024-03-20</date><risdate>2024</risdate><abstract>We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024
Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses
on using discrete speech unit learned from large speech corpora for some tasks.
We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and
Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained
on only speech corpora, which makes the learned codec represent rich acoustic
features that are necessary for high-fidelity speech reconstruction. For the
acoustic+vocoder track, we trained an acoustic model based on Transformer
encoder-decoder that predicted the pre-trained NAC tokens from text input. We
describe our strategies to build these models, such as data selection,
downsampling, and hyper-parameter tuning. Our system ranked in second and first
for the Vocoder and Acoustic+Vocoder tracks, respectively.</abstract><doi>10.48550/arxiv.2403.13720</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2403.13720 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2403_13720 |
source | arXiv.org |
subjects | Computer Science - Sound |
title | UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-15T14%3A15%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=UTDUSS:%20UTokyo-SaruLab%20System%20for%20Interspeech2024%20Speech%20Processing%20Using%20Discrete%20Speech%20Unit%20Challenge&rft.au=Nakata,%20Wataru&rft.date=2024-03-20&rft_id=info:doi/10.48550/arxiv.2403.13720&rft_dat=%3Carxiv_GOX%3E2403_13720%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |