Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems
While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our traini...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-09 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Liu, Jeongmin Song, Eunwoo |
description | While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3100999208</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3100999208</sourcerecordid><originalsourceid>FETCH-proquest_journals_31009992083</originalsourceid><addsrcrecordid>eNqNisFqwkAQQJdCQbH5h4GeA-uuVnNsS8WLB0n0KkszJhuSnXZnVvHvzcEP8PQevPeipsbaeb5eGDNRGXOntTYfK7Nc2qnqquh88KGBQ_AXjOx6ONIv1aPC1UsLG3SSIkI5EEk7nvmXY6zhMzUDBnHiKcAOpaWa4UwRtr5p831yvZcbVFUJ5Y0FB35Tr2fXM2YPztT75qf63uZ_kf4Tspw6SjGM6WTnWhdFYfTaPnfdARr1SLc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3100999208</pqid></control><display><type>article</type><title>Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems</title><source>Free E- Journals</source><creator>Liu, Jeongmin ; Song, Eunwoo</creator><creatorcontrib>Liu, Jeongmin ; Song, Eunwoo</creatorcontrib><description>While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Smoothing ; Speech recognition ; Vocoders ; Waveforms</subject><ispartof>arXiv.org, 2024-09</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>777,781</link.rule.ids></links><search><creatorcontrib>Liu, Jeongmin</creatorcontrib><creatorcontrib>Song, Eunwoo</creatorcontrib><title>Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems</title><title>arXiv.org</title><description>While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively.</description><subject>Smoothing</subject><subject>Speech recognition</subject><subject>Vocoders</subject><subject>Waveforms</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNisFqwkAQQJdCQbH5h4GeA-uuVnNsS8WLB0n0KkszJhuSnXZnVvHvzcEP8PQevPeipsbaeb5eGDNRGXOntTYfK7Nc2qnqquh88KGBQ_AXjOx6ONIv1aPC1UsLG3SSIkI5EEk7nvmXY6zhMzUDBnHiKcAOpaWa4UwRtr5p831yvZcbVFUJ5Y0FB35Tr2fXM2YPztT75qf63uZ_kf4Tspw6SjGM6WTnWhdFYfTaPnfdARr1SLc</recordid><startdate>20240904</startdate><enddate>20240904</enddate><creator>Liu, Jeongmin</creator><creator>Song, Eunwoo</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240904</creationdate><title>Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems</title><author>Liu, Jeongmin ; Song, Eunwoo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31009992083</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Smoothing</topic><topic>Speech recognition</topic><topic>Vocoders</topic><topic>Waveforms</topic><toplevel>online_resources</toplevel><creatorcontrib>Liu, Jeongmin</creatorcontrib><creatorcontrib>Song, Eunwoo</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Liu, Jeongmin</au><au>Song, Eunwoo</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems</atitle><jtitle>arXiv.org</jtitle><date>2024-09-04</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-09 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3100999208 |
source | Free E- Journals |
subjects | Smoothing Speech recognition Vocoders Waveforms |
title | Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T12%3A48%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Training%20Universal%20Vocoders%20with%20Feature%20Smoothing-Based%20Augmentation%20Methods%20for%20High-Quality%20TTS%20Systems&rft.jtitle=arXiv.org&rft.au=Liu,%20Jeongmin&rft.date=2024-09-04&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3100999208%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3100999208&rft_id=info:pmid/&rfr_iscdi=true |