MorphFader: Enabling Fine-grained Controllable Morphing with Text-to-Audio Models

Sound morphing is the process of gradually and smoothly transforming one sound into another to generate novel and perceptually hybrid sounds that simultaneously resemble both. Recently, diffusion-based text-to-audio models have produced high-quality sounds using text prompts. However, granularly con...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-08
Hauptverfasser:	Kamath, Purnima, Gupta, Chitralekha, Nanayakkara, Suranga
Format:	Artikel
Sprache:	eng
Schlagworte:	Controllability Diffusion layers Morphing Semantics
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Kamath, Purnima Gupta, Chitralekha Nanayakkara, Suranga
description	Sound morphing is the process of gradually and smoothly transforming one sound into another to generate novel and perceptually hybrid sounds that simultaneously resemble both. Recently, diffusion-based text-to-audio models have produced high-quality sounds using text prompts. However, granularly controlling the semantics of the sound, which is necessary for morphing, can be challenging using text. In this paper, we propose \textit{MorphFader}, a controllable method for morphing sounds generated by disparate prompts using text-to-audio models. By intercepting and interpolating the components of the cross-attention layers within the diffusion process, we can create smooth morphs between sounds generated by different text prompts. Using both objective metrics and perceptual listening tests, we demonstrate the ability of our method to granularly control the semantics in the sound and generate smooth morphs.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3093279854</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3093279854</sourcerecordid><originalsourceid>FETCH-proquest_journals_30932798543</originalsourceid><addsrcrecordid>eNqNysEKwiAcgHEJgkbtHYTOgunWtm4xNrp0CHYPQ9sc4n-pox4_ix6g03f4fQuUMM53pMwYW6HU-5FSyvYFy3OeoMsZ3DS0Qip3wI0VN6Ntj1ttFemdiJG4BhscGBNN4e_-WZ46DLhTr0ACkOMsNUSTyvgNWt6F8Sr9dY22bdPVJzI5eMzKh-sIs7ORrpxWnBVVmWf8v-sNcT4-_g</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3093279854</pqid></control><display><type>article</type><title>MorphFader: Enabling Fine-grained Controllable Morphing with Text-to-Audio Models</title><source>Free E- Journals</source><creator>Kamath, Purnima ; Gupta, Chitralekha ; Nanayakkara, Suranga</creator><creatorcontrib>Kamath, Purnima ; Gupta, Chitralekha ; Nanayakkara, Suranga</creatorcontrib><description>Sound morphing is the process of gradually and smoothly transforming one sound into another to generate novel and perceptually hybrid sounds that simultaneously resemble both. Recently, diffusion-based text-to-audio models have produced high-quality sounds using text prompts. However, granularly controlling the semantics of the sound, which is necessary for morphing, can be challenging using text. In this paper, we propose \textit{MorphFader}, a controllable method for morphing sounds generated by disparate prompts using text-to-audio models. By intercepting and interpolating the components of the cross-attention layers within the diffusion process, we can create smooth morphs between sounds generated by different text prompts. Using both objective metrics and perceptual listening tests, we demonstrate the ability of our method to granularly control the semantics in the sound and generate smooth morphs.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Controllability ; Diffusion layers ; Morphing ; Semantics</subject><ispartof>arXiv.org, 2024-08</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-nc-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Kamath, Purnima</creatorcontrib><creatorcontrib>Gupta, Chitralekha</creatorcontrib><creatorcontrib>Nanayakkara, Suranga</creatorcontrib><title>MorphFader: Enabling Fine-grained Controllable Morphing with Text-to-Audio Models</title><title>arXiv.org</title><description>Sound morphing is the process of gradually and smoothly transforming one sound into another to generate novel and perceptually hybrid sounds that simultaneously resemble both. Recently, diffusion-based text-to-audio models have produced high-quality sounds using text prompts. However, granularly controlling the semantics of the sound, which is necessary for morphing, can be challenging using text. In this paper, we propose \textit{MorphFader}, a controllable method for morphing sounds generated by disparate prompts using text-to-audio models. By intercepting and interpolating the components of the cross-attention layers within the diffusion process, we can create smooth morphs between sounds generated by different text prompts. Using both objective metrics and perceptual listening tests, we demonstrate the ability of our method to granularly control the semantics in the sound and generate smooth morphs.</description><subject>Controllability</subject><subject>Diffusion layers</subject><subject>Morphing</subject><subject>Semantics</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNysEKwiAcgHEJgkbtHYTOgunWtm4xNrp0CHYPQ9sc4n-pox4_ix6g03f4fQuUMM53pMwYW6HU-5FSyvYFy3OeoMsZ3DS0Qip3wI0VN6Ntj1ttFemdiJG4BhscGBNN4e_-WZ46DLhTr0ACkOMsNUSTyvgNWt6F8Sr9dY22bdPVJzI5eMzKh-sIs7ORrpxWnBVVmWf8v-sNcT4-_g</recordid><startdate>20240814</startdate><enddate>20240814</enddate><creator>Kamath, Purnima</creator><creator>Gupta, Chitralekha</creator><creator>Nanayakkara, Suranga</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240814</creationdate><title>MorphFader: Enabling Fine-grained Controllable Morphing with Text-to-Audio Models</title><author>Kamath, Purnima ; Gupta, Chitralekha ; Nanayakkara, Suranga</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30932798543</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Controllability</topic><topic>Diffusion layers</topic><topic>Morphing</topic><topic>Semantics</topic><toplevel>online_resources</toplevel><creatorcontrib>Kamath, Purnima</creatorcontrib><creatorcontrib>Gupta, Chitralekha</creatorcontrib><creatorcontrib>Nanayakkara, Suranga</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kamath, Purnima</au><au>Gupta, Chitralekha</au><au>Nanayakkara, Suranga</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>MorphFader: Enabling Fine-grained Controllable Morphing with Text-to-Audio Models</atitle><jtitle>arXiv.org</jtitle><date>2024-08-14</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Sound morphing is the process of gradually and smoothly transforming one sound into another to generate novel and perceptually hybrid sounds that simultaneously resemble both. Recently, diffusion-based text-to-audio models have produced high-quality sounds using text prompts. However, granularly controlling the semantics of the sound, which is necessary for morphing, can be challenging using text. In this paper, we propose \textit{MorphFader}, a controllable method for morphing sounds generated by disparate prompts using text-to-audio models. By intercepting and interpolating the components of the cross-attention layers within the diffusion process, we can create smooth morphs between sounds generated by different text prompts. Using both objective metrics and perceptual listening tests, we demonstrate the ability of our method to granularly control the semantics in the sound and generate smooth morphs.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-08
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3093279854
source	Free E- Journals
subjects	Controllability Diffusion layers Morphing Semantics
title	MorphFader: Enabling Fine-grained Controllable Morphing with Text-to-Audio Models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T07%3A05%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=MorphFader:%20Enabling%20Fine-grained%20Controllable%20Morphing%20with%20Text-to-Audio%20Models&rft.jtitle=arXiv.org&rft.au=Kamath,%20Purnima&rft.date=2024-08-14&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3093279854%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3093279854&rft_id=info:pmid/&rfr_iscdi=true