Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature
Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propos...
Gespeichert in:
Veröffentlicht in: | IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2023-01, Vol.31, p.1-12 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 12 |
---|---|
container_issue | |
container_start_page | 1 |
container_title | IEEE/ACM transactions on audio, speech, and language processing |
container_volume | 31 |
creator | Du, Chenpeng Guo, Yiwei Chen, Xie Yu, Kai |
description | Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech. |
doi_str_mv | 10.1109/TASLP.2023.3308374 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10229489</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10229489</ieee_id><sourcerecordid>2881501344</sourcerecordid><originalsourceid>FETCH-LOGICAL-c247t-dcdc531af0cde33a608224088453dfe2ccbd07086c40c471e8e92d1bf2881b993</originalsourceid><addsrcrecordid>eNpNkNtKAzEQhoMoWGpfQLxY8Dp1cmg3uSzFqlCs0tXbkE1m6da2W7NZT0_v9iB4NcPM98_AR8glgz5joG-y0Xz61OfARV8IUCKVJ6TDBddUC5Cnfz3XcE56db0EAAap1qnskNl8i_YNQzLydhvLD0wy_Io0VrRdoFskn2VcJFm5zgPSxyqs7ar8QZ-8ootVoM-N3cT9YII2NgEvyFlhVzX2jrVLXia32fieTmd3D-PRlDou00i9824gmC3AeRTCDkFxLkEpORC-QO5c7iEFNXQSnEwZKtTcs7zgSrFca9El14e721C9N1hHs6yasGlfmh0yACakbCl-oFyo6jpgYbahXNvwbRiYnTuzd2d27szRXRu6OoRKRPwX4FxLpcUv8ChqpA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2881501344</pqid></control><display><type>article</type><title>Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature</title><source>IEEE/IET Electronic Library (IEL)</source><creator>Du, Chenpeng ; Guo, Yiwei ; Chen, Xie ; Yu, Kai</creator><creatorcontrib>Du, Chenpeng ; Guo, Yiwei ; Chen, Xie ; Yu, Kai</creatorcontrib><description>Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2023.3308374</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustics ; Adaptation ; Adaptation models ; Decomposition ; Embedding ; Feature extraction ; Lookup tables ; Similarity ; speaker adaptation ; Speaking ; Speech processing ; Speech recognition ; speech synthesis ; Timbre ; timbre normalization ; Training ; vector quantization ; Vocoders</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-12</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c247t-dcdc531af0cde33a608224088453dfe2ccbd07086c40c471e8e92d1bf2881b993</cites><orcidid>0000-0001-7423-617X ; 0000-0002-7102-9826 ; 0000-0001-5329-0847 ; 0009-0003-8114-2085</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10229489$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10229489$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Du, Chenpeng</creatorcontrib><creatorcontrib>Guo, Yiwei</creatorcontrib><creatorcontrib>Chen, Xie</creatorcontrib><creatorcontrib>Yu, Kai</creatorcontrib><title>Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.</description><subject>Acoustics</subject><subject>Adaptation</subject><subject>Adaptation models</subject><subject>Decomposition</subject><subject>Embedding</subject><subject>Feature extraction</subject><subject>Lookup tables</subject><subject>Similarity</subject><subject>speaker adaptation</subject><subject>Speaking</subject><subject>Speech processing</subject><subject>Speech recognition</subject><subject>speech synthesis</subject><subject>Timbre</subject><subject>timbre normalization</subject><subject>Training</subject><subject>vector quantization</subject><subject>Vocoders</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkNtKAzEQhoMoWGpfQLxY8Dp1cmg3uSzFqlCs0tXbkE1m6da2W7NZT0_v9iB4NcPM98_AR8glgz5joG-y0Xz61OfARV8IUCKVJ6TDBddUC5Cnfz3XcE56db0EAAap1qnskNl8i_YNQzLydhvLD0wy_Io0VrRdoFskn2VcJFm5zgPSxyqs7ar8QZ-8ootVoM-N3cT9YII2NgEvyFlhVzX2jrVLXia32fieTmd3D-PRlDou00i9824gmC3AeRTCDkFxLkEpORC-QO5c7iEFNXQSnEwZKtTcs7zgSrFca9El14e721C9N1hHs6yasGlfmh0yACakbCl-oFyo6jpgYbahXNvwbRiYnTuzd2d27szRXRu6OoRKRPwX4FxLpcUv8ChqpA</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Du, Chenpeng</creator><creator>Guo, Yiwei</creator><creator>Chen, Xie</creator><creator>Yu, Kai</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-7423-617X</orcidid><orcidid>https://orcid.org/0000-0002-7102-9826</orcidid><orcidid>https://orcid.org/0000-0001-5329-0847</orcidid><orcidid>https://orcid.org/0009-0003-8114-2085</orcidid></search><sort><creationdate>20230101</creationdate><title>Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature</title><author>Du, Chenpeng ; Guo, Yiwei ; Chen, Xie ; Yu, Kai</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c247t-dcdc531af0cde33a608224088453dfe2ccbd07086c40c471e8e92d1bf2881b993</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Acoustics</topic><topic>Adaptation</topic><topic>Adaptation models</topic><topic>Decomposition</topic><topic>Embedding</topic><topic>Feature extraction</topic><topic>Lookup tables</topic><topic>Similarity</topic><topic>speaker adaptation</topic><topic>Speaking</topic><topic>Speech processing</topic><topic>Speech recognition</topic><topic>speech synthesis</topic><topic>Timbre</topic><topic>timbre normalization</topic><topic>Training</topic><topic>vector quantization</topic><topic>Vocoders</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Du, Chenpeng</creatorcontrib><creatorcontrib>Guo, Yiwei</creatorcontrib><creatorcontrib>Chen, Xie</creatorcontrib><creatorcontrib>Yu, Kai</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Du, Chenpeng</au><au>Guo, Yiwei</au><au>Chen, Xie</au><au>Yu, Kai</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>31</volume><spage>1</spage><epage>12</epage><pages>1-12</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with limited amount of data is a challenging task. Most existing methods only consider adapting to the timbre of the target speakers but fail to capture their speaking styles from little data. In this work, we propose a novel TTS system, TN-VQTTS, which leverages timbre-normalized vector-quantized (TN-VQ) acoustic feature for speaker adaptation with little data. With the TN-VQ feature, speaking style and timbre can be effectively decomposed and controlled by the acoustic model and the vocoder separately of VQTTS. Such decomposition enables us to finely mimic both the two characteristics of the target speaker in adaptation with little data. Specifically, we first reduce the dimensionality of self-supervised VQ acoustic feature via PCA and normalize its timbre with a normalizing flow model. The feature is then quantized with k-means and used as the TN-VQ feature for a multi-speaker VQ-TTS system. Furthermore, we optimize timbre-independent style embeddings of the training speakers jointly with the acoustic model and store them in a lookup table. The embedding table later serves as a selectable codebook or a group of basis for representing the style of unseen speakers. Our experiments on LibriTTS dataset first show that the proposed model architecture for VQ feature achieves better performance in multi-speaker text-to-speech synthesis than several existing methods. We also find that the reconstruction performance and the naturalness are almost unchanged after applying timbre normalization and k-means quantization. Finally, we show that TN-VQTTS achieves better performance on speaker similarity in adaptation than both speaker embedding based adaptation method and fine-tuning based baseline AdaSpeech.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2023.3308374</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-7423-617X</orcidid><orcidid>https://orcid.org/0000-0002-7102-9826</orcidid><orcidid>https://orcid.org/0000-0001-5329-0847</orcidid><orcidid>https://orcid.org/0009-0003-8114-2085</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 2329-9290 |
ispartof | IEEE/ACM transactions on audio, speech, and language processing, 2023-01, Vol.31, p.1-12 |
issn | 2329-9290 2329-9304 |
language | eng |
recordid | cdi_ieee_primary_10229489 |
source | IEEE/IET Electronic Library (IEL) |
subjects | Acoustics Adaptation Adaptation models Decomposition Embedding Feature extraction Lookup tables Similarity speaker adaptation Speaking Speech processing Speech recognition speech synthesis Timbre timbre normalization Training vector quantization Vocoders |
title | Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T16%3A10%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Speaker%20Adaptive%20Text-to-Speech%20with%20Timbre-Normalized%20Vector-Quantized%20Feature&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Du,%20Chenpeng&rft.date=2023-01-01&rft.volume=31&rft.spage=1&rft.epage=12&rft.pages=1-12&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2023.3308374&rft_dat=%3Cproquest_RIE%3E2881501344%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2881501344&rft_id=info:pmid/&rft_ieee_id=10229489&rfr_iscdi=true |