SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and p...
Gespeichert in:
Veröffentlicht in: | IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2024, Vol.32, p.3355-3364 |
---|---|
Hauptverfasser: | , , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 3364 |
---|---|
container_issue | |
container_start_page | 3355 |
container_title | IEEE/ACM transactions on audio, speech, and language processing |
container_volume | 32 |
creator | Wang, Xiaofei Thakker, Manthan Chen, Zhuo Kanda, Naoyuki Eskimez, Sefik Emre Chen, Sanyuan Tang, Min Liu, Shujie Li, Jinyu Yoshioka, Takuya |
description | Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. |
doi_str_mv | 10.1109/TASLP.2024.3419418 |
format | Article |
fullrecord | <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_ieee_primary_10577150</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10577150</ieee_id><sourcerecordid>10_1109_TASLP_2024_3419418</sourcerecordid><originalsourceid>FETCH-LOGICAL-c149t-54fec5cfe4187f418f16ce3ee10076ea7aa330264182ed44c7186fbf0aa0f0853</originalsourceid><addsrcrecordid>eNpNkNtKw0AQhhdRsNS-gHixL5A4e0g2610JVoV4gEbxLozb2RpJm7LbXvj2praCNzPzM3zD8DF2KSAVAux1PZ1XL6kEqVOlhdWiOGEjqaRNrAJ9-jdLC-dsEuMXAAgw1ho9YrP5hsh9vt_wJ9oF7HjZL8jxCtfLHS6JPw6x4xg58jcKEbdtR_zA8DrgOvo-rChcsDOPXaTJsY_Z6-y2Lu-T6vnuoZxWiRPabpNMe3KZ8zT8aPxQvMgdKSIBYHJCg6gUyHzYSFpo7Ywocv_hARE8FJkaM3m460IfYyDfbEK7wvDdCGj2MppfGc1eRnOUMUBXB6glon9AZozIQP0A-CFaqg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SpeechX: Neural Codec Language Model as a Versatile Speech Transformer</title><source>IEL</source><creator>Wang, Xiaofei ; Thakker, Manthan ; Chen, Zhuo ; Kanda, Naoyuki ; Eskimez, Sefik Emre ; Chen, Sanyuan ; Tang, Min ; Liu, Shujie ; Li, Jinyu ; Yoshioka, Takuya</creator><creatorcontrib>Wang, Xiaofei ; Thakker, Manthan ; Chen, Zhuo ; Kanda, Naoyuki ; Eskimez, Sefik Emre ; Chen, Sanyuan ; Tang, Min ; Liu, Shujie ; Li, Jinyu ; Yoshioka, Takuya</creatorcontrib><description>Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2024.3419418</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>IEEE</publisher><subject>Acoustics ; audio-text input ; Codecs ; Codes ; multi-task learning ; Noise reduction ; noise suppression ; Speech coding ; speech editing ; Speech enhancement ; Speech generation ; speech removal ; target speaker extraction ; Task analysis ; zero-shot text-to-speech</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2024, Vol.32, p.3355-3364</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c149t-54fec5cfe4187f418f16ce3ee10076ea7aa330264182ed44c7186fbf0aa0f0853</cites><orcidid>0000-0002-8837-401X ; 0000-0002-8628-3288 ; 0009-0006-3428-9967 ; 0000-0002-1089-9748 ; 0000-0002-3082-6052 ; 0009-0003-7791-3545</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10577150$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,4010,27900,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10577150$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Wang, Xiaofei</creatorcontrib><creatorcontrib>Thakker, Manthan</creatorcontrib><creatorcontrib>Chen, Zhuo</creatorcontrib><creatorcontrib>Kanda, Naoyuki</creatorcontrib><creatorcontrib>Eskimez, Sefik Emre</creatorcontrib><creatorcontrib>Chen, Sanyuan</creatorcontrib><creatorcontrib>Tang, Min</creatorcontrib><creatorcontrib>Liu, Shujie</creatorcontrib><creatorcontrib>Li, Jinyu</creatorcontrib><creatorcontrib>Yoshioka, Takuya</creatorcontrib><title>SpeechX: Neural Codec Language Model as a Versatile Speech Transformer</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks.</description><subject>Acoustics</subject><subject>audio-text input</subject><subject>Codecs</subject><subject>Codes</subject><subject>multi-task learning</subject><subject>Noise reduction</subject><subject>noise suppression</subject><subject>Speech coding</subject><subject>speech editing</subject><subject>Speech enhancement</subject><subject>Speech generation</subject><subject>speech removal</subject><subject>target speaker extraction</subject><subject>Task analysis</subject><subject>zero-shot text-to-speech</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkNtKw0AQhhdRsNS-gHixL5A4e0g2610JVoV4gEbxLozb2RpJm7LbXvj2praCNzPzM3zD8DF2KSAVAux1PZ1XL6kEqVOlhdWiOGEjqaRNrAJ9-jdLC-dsEuMXAAgw1ho9YrP5hsh9vt_wJ9oF7HjZL8jxCtfLHS6JPw6x4xg58jcKEbdtR_zA8DrgOvo-rChcsDOPXaTJsY_Z6-y2Lu-T6vnuoZxWiRPabpNMe3KZ8zT8aPxQvMgdKSIBYHJCg6gUyHzYSFpo7Ywocv_hARE8FJkaM3m460IfYyDfbEK7wvDdCGj2MppfGc1eRnOUMUBXB6glon9AZozIQP0A-CFaqg</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Wang, Xiaofei</creator><creator>Thakker, Manthan</creator><creator>Chen, Zhuo</creator><creator>Kanda, Naoyuki</creator><creator>Eskimez, Sefik Emre</creator><creator>Chen, Sanyuan</creator><creator>Tang, Min</creator><creator>Liu, Shujie</creator><creator>Li, Jinyu</creator><creator>Yoshioka, Takuya</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-8837-401X</orcidid><orcidid>https://orcid.org/0000-0002-8628-3288</orcidid><orcidid>https://orcid.org/0009-0006-3428-9967</orcidid><orcidid>https://orcid.org/0000-0002-1089-9748</orcidid><orcidid>https://orcid.org/0000-0002-3082-6052</orcidid><orcidid>https://orcid.org/0009-0003-7791-3545</orcidid></search><sort><creationdate>2024</creationdate><title>SpeechX: Neural Codec Language Model as a Versatile Speech Transformer</title><author>Wang, Xiaofei ; Thakker, Manthan ; Chen, Zhuo ; Kanda, Naoyuki ; Eskimez, Sefik Emre ; Chen, Sanyuan ; Tang, Min ; Liu, Shujie ; Li, Jinyu ; Yoshioka, Takuya</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c149t-54fec5cfe4187f418f16ce3ee10076ea7aa330264182ed44c7186fbf0aa0f0853</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Acoustics</topic><topic>audio-text input</topic><topic>Codecs</topic><topic>Codes</topic><topic>multi-task learning</topic><topic>Noise reduction</topic><topic>noise suppression</topic><topic>Speech coding</topic><topic>speech editing</topic><topic>Speech enhancement</topic><topic>Speech generation</topic><topic>speech removal</topic><topic>target speaker extraction</topic><topic>Task analysis</topic><topic>zero-shot text-to-speech</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Xiaofei</creatorcontrib><creatorcontrib>Thakker, Manthan</creatorcontrib><creatorcontrib>Chen, Zhuo</creatorcontrib><creatorcontrib>Kanda, Naoyuki</creatorcontrib><creatorcontrib>Eskimez, Sefik Emre</creatorcontrib><creatorcontrib>Chen, Sanyuan</creatorcontrib><creatorcontrib>Tang, Min</creatorcontrib><creatorcontrib>Liu, Shujie</creatorcontrib><creatorcontrib>Li, Jinyu</creatorcontrib><creatorcontrib>Yoshioka, Takuya</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEL</collection><collection>CrossRef</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Xiaofei</au><au>Thakker, Manthan</au><au>Chen, Zhuo</au><au>Kanda, Naoyuki</au><au>Eskimez, Sefik Emre</au><au>Chen, Sanyuan</au><au>Tang, Min</au><au>Liu, Shujie</au><au>Li, Jinyu</au><au>Yoshioka, Takuya</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SpeechX: Neural Codec Language Model as a Versatile Speech Transformer</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2024</date><risdate>2024</risdate><volume>32</volume><spage>3355</spage><epage>3364</epage><pages>3355-3364</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks.</abstract><pub>IEEE</pub><doi>10.1109/TASLP.2024.3419418</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0002-8837-401X</orcidid><orcidid>https://orcid.org/0000-0002-8628-3288</orcidid><orcidid>https://orcid.org/0009-0006-3428-9967</orcidid><orcidid>https://orcid.org/0000-0002-1089-9748</orcidid><orcidid>https://orcid.org/0000-0002-3082-6052</orcidid><orcidid>https://orcid.org/0009-0003-7791-3545</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 2329-9290 |
ispartof | IEEE/ACM transactions on audio, speech, and language processing, 2024, Vol.32, p.3355-3364 |
issn | 2329-9290 2329-9304 |
language | eng |
recordid | cdi_ieee_primary_10577150 |
source | IEL |
subjects | Acoustics audio-text input Codecs Codes multi-task learning Noise reduction noise suppression Speech coding speech editing Speech enhancement Speech generation speech removal target speaker extraction Task analysis zero-shot text-to-speech |
title | SpeechX: Neural Codec Language Model as a Versatile Speech Transformer |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T10%3A47%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SpeechX:%20Neural%20Codec%20Language%20Model%20as%20a%20Versatile%20Speech%20Transformer&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Wang,%20Xiaofei&rft.date=2024&rft.volume=32&rft.spage=3355&rft.epage=3364&rft.pages=3355-3364&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2024.3419418&rft_dat=%3Ccrossref_RIE%3E10_1109_TASLP_2024_3419418%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10577150&rfr_iscdi=true |