Learning Source Disentanglement in Neural Audio Codec

Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural co...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-09
Hauptverfasser:	Bie, Xiaoyu, Liu, Xubo, Gaël, Richard
Format:	Artikel
Sprache:	eng
Schlagworte:	Audio data Audio signals Background noise Codec Learning Separation Signal generation Signal quality Sound effects Sound generation
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Bie, Xiaoyu Liu, Xubo Gaël, Richard
description	Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3106537797</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3106537797</sourcerecordid><originalsourceid>FETCH-proquest_journals_31065377973</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQw9UlNLMrLzEtXCM4vLUpOVXDJLE7NK0nMS89JzQUyFDLzFPxSS4sScxQcS1My8xWc81NSk3kYWNMSc4pTeaE0N4Oym2uIs4duQVF-YWlqcUl8FtC0PKBUvLGhgZmpsbm5pbkxcaoAceE0qw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3106537797</pqid></control><display><type>article</type><title>Learning Source Disentanglement in Neural Audio Codec</title><source>Free E- Journals</source><creator>Bie, Xiaoyu ; Liu, Xubo ; Gaël, Richard</creator><creatorcontrib>Bie, Xiaoyu ; Liu, Xubo ; Gaël, Richard</creatorcontrib><description>Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Audio data ; Audio signals ; Background noise ; Codec ; Learning ; Separation ; Signal generation ; Signal quality ; Sound effects ; Sound generation</subject><ispartof>arXiv.org, 2024-09</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Bie, Xiaoyu</creatorcontrib><creatorcontrib>Liu, Xubo</creatorcontrib><creatorcontrib>Gaël, Richard</creatorcontrib><title>Learning Source Disentanglement in Neural Audio Codec</title><title>arXiv.org</title><description>Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.</description><subject>Audio data</subject><subject>Audio signals</subject><subject>Background noise</subject><subject>Codec</subject><subject>Learning</subject><subject>Separation</subject><subject>Signal generation</subject><subject>Signal quality</subject><subject>Sound effects</subject><subject>Sound generation</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQw9UlNLMrLzEtXCM4vLUpOVXDJLE7NK0nMS89JzQUyFDLzFPxSS4sScxQcS1My8xWc81NSk3kYWNMSc4pTeaE0N4Oym2uIs4duQVF-YWlqcUl8FtC0PKBUvLGhgZmpsbm5pbkxcaoAceE0qw</recordid><startdate>20240917</startdate><enddate>20240917</enddate><creator>Bie, Xiaoyu</creator><creator>Liu, Xubo</creator><creator>Gaël, Richard</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240917</creationdate><title>Learning Source Disentanglement in Neural Audio Codec</title><author>Bie, Xiaoyu ; Liu, Xubo ; Gaël, Richard</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31065377973</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Audio data</topic><topic>Audio signals</topic><topic>Background noise</topic><topic>Codec</topic><topic>Learning</topic><topic>Separation</topic><topic>Signal generation</topic><topic>Signal quality</topic><topic>Sound effects</topic><topic>Sound generation</topic><toplevel>online_resources</toplevel><creatorcontrib>Bie, Xiaoyu</creatorcontrib><creatorcontrib>Liu, Xubo</creatorcontrib><creatorcontrib>Gaël, Richard</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bie, Xiaoyu</au><au>Liu, Xubo</au><au>Gaël, Richard</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Learning Source Disentanglement in Neural Audio Codec</atitle><jtitle>arXiv.org</jtitle><date>2024-09-17</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-09
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3106537797
source	Free E- Journals
subjects	Audio data Audio signals Background noise Codec Learning Separation Signal generation Signal quality Sound effects Sound generation
title	Learning Source Disentanglement in Neural Audio Codec
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T15%3A23%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Learning%20Source%20Disentanglement%20in%20Neural%20Audio%20Codec&rft.jtitle=arXiv.org&rft.au=Bie,%20Xiaoyu&rft.date=2024-09-17&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3106537797%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3106537797&rft_id=info:pmid/&rfr_iscdi=true