SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and p...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2024, Vol.32, p.3355-3364
Hauptverfasser: Wang, Xiaofei, Thakker, Manthan, Chen, Zhuo, Kanda, Naoyuki, Eskimez, Sefik Emre, Chen, Sanyuan, Tang, Min, Liu, Shujie, Li, Jinyu, Yoshioka, Takuya
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 3364
container_issue
container_start_page 3355
container_title IEEE/ACM transactions on audio, speech, and language processing
container_volume 32
creator Wang, Xiaofei
Thakker, Manthan
Chen, Zhuo
Kanda, Naoyuki
Eskimez, Sefik Emre
Chen, Sanyuan
Tang, Min
Liu, Shujie
Li, Jinyu
Yoshioka, Takuya
description Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks.
doi_str_mv 10.1109/TASLP.2024.3419418
format Article
fullrecord <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_ieee_primary_10577150</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10577150</ieee_id><sourcerecordid>10_1109_TASLP_2024_3419418</sourcerecordid><originalsourceid>FETCH-LOGICAL-c149t-54fec5cfe4187f418f16ce3ee10076ea7aa330264182ed44c7186fbf0aa0f0853</originalsourceid><addsrcrecordid>eNpNkNtKw0AQhhdRsNS-gHixL5A4e0g2610JVoV4gEbxLozb2RpJm7LbXvj2praCNzPzM3zD8DF2KSAVAux1PZ1XL6kEqVOlhdWiOGEjqaRNrAJ9-jdLC-dsEuMXAAgw1ho9YrP5hsh9vt_wJ9oF7HjZL8jxCtfLHS6JPw6x4xg58jcKEbdtR_zA8DrgOvo-rChcsDOPXaTJsY_Z6-y2Lu-T6vnuoZxWiRPabpNMe3KZ8zT8aPxQvMgdKSIBYHJCg6gUyHzYSFpo7Ywocv_hARE8FJkaM3m460IfYyDfbEK7wvDdCGj2MppfGc1eRnOUMUBXB6glon9AZozIQP0A-CFaqg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SpeechX: Neural Codec Language Model as a Versatile Speech Transformer</title><source>IEL</source><creator>Wang, Xiaofei ; Thakker, Manthan ; Chen, Zhuo ; Kanda, Naoyuki ; Eskimez, Sefik Emre ; Chen, Sanyuan ; Tang, Min ; Liu, Shujie ; Li, Jinyu ; Yoshioka, Takuya</creator><creatorcontrib>Wang, Xiaofei ; Thakker, Manthan ; Chen, Zhuo ; Kanda, Naoyuki ; Eskimez, Sefik Emre ; Chen, Sanyuan ; Tang, Min ; Liu, Shujie ; Li, Jinyu ; Yoshioka, Takuya</creatorcontrib><description>Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2024.3419418</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>IEEE</publisher><subject>Acoustics ; audio-text input ; Codecs ; Codes ; multi-task learning ; Noise reduction ; noise suppression ; Speech coding ; speech editing ; Speech enhancement ; Speech generation ; speech removal ; target speaker extraction ; Task analysis ; zero-shot text-to-speech</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2024, Vol.32, p.3355-3364</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c149t-54fec5cfe4187f418f16ce3ee10076ea7aa330264182ed44c7186fbf0aa0f0853</cites><orcidid>0000-0002-8837-401X ; 0000-0002-8628-3288 ; 0009-0006-3428-9967 ; 0000-0002-1089-9748 ; 0000-0002-3082-6052 ; 0009-0003-7791-3545</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10577150$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,4010,27900,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10577150$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Wang, Xiaofei</creatorcontrib><creatorcontrib>Thakker, Manthan</creatorcontrib><creatorcontrib>Chen, Zhuo</creatorcontrib><creatorcontrib>Kanda, Naoyuki</creatorcontrib><creatorcontrib>Eskimez, Sefik Emre</creatorcontrib><creatorcontrib>Chen, Sanyuan</creatorcontrib><creatorcontrib>Tang, Min</creatorcontrib><creatorcontrib>Liu, Shujie</creatorcontrib><creatorcontrib>Li, Jinyu</creatorcontrib><creatorcontrib>Yoshioka, Takuya</creatorcontrib><title>SpeechX: Neural Codec Language Model as a Versatile Speech Transformer</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks.</description><subject>Acoustics</subject><subject>audio-text input</subject><subject>Codecs</subject><subject>Codes</subject><subject>multi-task learning</subject><subject>Noise reduction</subject><subject>noise suppression</subject><subject>Speech coding</subject><subject>speech editing</subject><subject>Speech enhancement</subject><subject>Speech generation</subject><subject>speech removal</subject><subject>target speaker extraction</subject><subject>Task analysis</subject><subject>zero-shot text-to-speech</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkNtKw0AQhhdRsNS-gHixL5A4e0g2610JVoV4gEbxLozb2RpJm7LbXvj2praCNzPzM3zD8DF2KSAVAux1PZ1XL6kEqVOlhdWiOGEjqaRNrAJ9-jdLC-dsEuMXAAgw1ho9YrP5hsh9vt_wJ9oF7HjZL8jxCtfLHS6JPw6x4xg58jcKEbdtR_zA8DrgOvo-rChcsDOPXaTJsY_Z6-y2Lu-T6vnuoZxWiRPabpNMe3KZ8zT8aPxQvMgdKSIBYHJCg6gUyHzYSFpo7Ywocv_hARE8FJkaM3m460IfYyDfbEK7wvDdCGj2MppfGc1eRnOUMUBXB6glon9AZozIQP0A-CFaqg</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Wang, Xiaofei</creator><creator>Thakker, Manthan</creator><creator>Chen, Zhuo</creator><creator>Kanda, Naoyuki</creator><creator>Eskimez, Sefik Emre</creator><creator>Chen, Sanyuan</creator><creator>Tang, Min</creator><creator>Liu, Shujie</creator><creator>Li, Jinyu</creator><creator>Yoshioka, Takuya</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-8837-401X</orcidid><orcidid>https://orcid.org/0000-0002-8628-3288</orcidid><orcidid>https://orcid.org/0009-0006-3428-9967</orcidid><orcidid>https://orcid.org/0000-0002-1089-9748</orcidid><orcidid>https://orcid.org/0000-0002-3082-6052</orcidid><orcidid>https://orcid.org/0009-0003-7791-3545</orcidid></search><sort><creationdate>2024</creationdate><title>SpeechX: Neural Codec Language Model as a Versatile Speech Transformer</title><author>Wang, Xiaofei ; Thakker, Manthan ; Chen, Zhuo ; Kanda, Naoyuki ; Eskimez, Sefik Emre ; Chen, Sanyuan ; Tang, Min ; Liu, Shujie ; Li, Jinyu ; Yoshioka, Takuya</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c149t-54fec5cfe4187f418f16ce3ee10076ea7aa330264182ed44c7186fbf0aa0f0853</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Acoustics</topic><topic>audio-text input</topic><topic>Codecs</topic><topic>Codes</topic><topic>multi-task learning</topic><topic>Noise reduction</topic><topic>noise suppression</topic><topic>Speech coding</topic><topic>speech editing</topic><topic>Speech enhancement</topic><topic>Speech generation</topic><topic>speech removal</topic><topic>target speaker extraction</topic><topic>Task analysis</topic><topic>zero-shot text-to-speech</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, Xiaofei</creatorcontrib><creatorcontrib>Thakker, Manthan</creatorcontrib><creatorcontrib>Chen, Zhuo</creatorcontrib><creatorcontrib>Kanda, Naoyuki</creatorcontrib><creatorcontrib>Eskimez, Sefik Emre</creatorcontrib><creatorcontrib>Chen, Sanyuan</creatorcontrib><creatorcontrib>Tang, Min</creatorcontrib><creatorcontrib>Liu, Shujie</creatorcontrib><creatorcontrib>Li, Jinyu</creatorcontrib><creatorcontrib>Yoshioka, Takuya</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEL</collection><collection>CrossRef</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Xiaofei</au><au>Thakker, Manthan</au><au>Chen, Zhuo</au><au>Kanda, Naoyuki</au><au>Eskimez, Sefik Emre</au><au>Chen, Sanyuan</au><au>Tang, Min</au><au>Liu, Shujie</au><au>Li, Jinyu</au><au>Yoshioka, Takuya</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SpeechX: Neural Codec Language Model as a Versatile Speech Transformer</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2024</date><risdate>2024</risdate><volume>32</volume><spage>3355</spage><epage>3364</epage><pages>3355-3364</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks.</abstract><pub>IEEE</pub><doi>10.1109/TASLP.2024.3419418</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0002-8837-401X</orcidid><orcidid>https://orcid.org/0000-0002-8628-3288</orcidid><orcidid>https://orcid.org/0009-0006-3428-9967</orcidid><orcidid>https://orcid.org/0000-0002-1089-9748</orcidid><orcidid>https://orcid.org/0000-0002-3082-6052</orcidid><orcidid>https://orcid.org/0009-0003-7791-3545</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2329-9290
ispartof IEEE/ACM transactions on audio, speech, and language processing, 2024, Vol.32, p.3355-3364
issn 2329-9290
2329-9304
language eng
recordid cdi_ieee_primary_10577150
source IEL
subjects Acoustics
audio-text input
Codecs
Codes
multi-task learning
Noise reduction
noise suppression
Speech coding
speech editing
Speech enhancement
Speech generation
speech removal
target speaker extraction
Task analysis
zero-shot text-to-speech
title SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T10%3A47%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SpeechX:%20Neural%20Codec%20Language%20Model%20as%20a%20Versatile%20Speech%20Transformer&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Wang,%20Xiaofei&rft.date=2024&rft.volume=32&rft.spage=3355&rft.epage=3364&rft.pages=3355-3364&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2024.3419418&rft_dat=%3Ccrossref_RIE%3E10_1109_TASLP_2024_3419418%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10577150&rfr_iscdi=true