LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Huang, Runhui, Cai, Kaixin, Han, Jianhua, Liang, Xiaodan, Pei, Renjing, Lu, Guansong, Xu, Songcen, Zhang, Wei, Xu, Hang
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Huang, Runhui Cai, Kaixin Han, Jianhua Liang, Xiaodan Pei, Renjing Lu, Guansong Xu, Songcen Zhang, Wei Xu, Hang
description	Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.
doi_str_mv	10.48550/arxiv.2403.11929
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2403_11929</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2403_11929</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-3908e3ea6a7ef7ef1ae128d54e2f433acc4ab7dff798f39237b6623f22d7f3ae3</originalsourceid><addsrcrecordid>eNotj8tKxDAYhbNxIaMP4Mq8QGqb9BZ3Ukcd6ODC7svf6Z8aSJuaXmjf3mkVDhwOHD74CHkIfC9Mo8h_Arfo2eOhL7wgkFzekp8cVnSvWqlnelx6Y53uGlrgMrJm0jXW9DyZUTOz3a4rs21vB6gM0lMLDdKvtRu_cdADnTXQncYyawxU1sGoZ6QbfBq07ejZ1mjuyI0CM-D9fx9I8XYssg-Wf76fspecQZxIJqSfokCIIUF1TQAY8LSOQuQqFAIulxCqpFYqkakSkoukimMuFOd1ogSgOJDHP-zuXPZOt-DWcnMvd3fxC64ZVtM</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model</title><source>arXiv.org</source><creator>Huang, Runhui ; Cai, Kaixin ; Han, Jianhua ; Liang, Xiaodan ; Pei, Renjing ; Lu, Guansong ; Xu, Songcen ; Zhang, Wei ; Xu, Hang</creator><creatorcontrib>Huang, Runhui ; Cai, Kaixin ; Han, Jianhua ; Liang, Xiaodan ; Pei, Renjing ; Lu, Guansong ; Xu, Songcen ; Zhang, Wei ; Xu, Hang</creatorcontrib><description>Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.</description><identifier>DOI: 10.48550/arxiv.2403.11929</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2403.11929$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.11929$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Huang, Runhui</creatorcontrib><creatorcontrib>Cai, Kaixin</creatorcontrib><creatorcontrib>Han, Jianhua</creatorcontrib><creatorcontrib>Liang, Xiaodan</creatorcontrib><creatorcontrib>Pei, Renjing</creatorcontrib><creatorcontrib>Lu, Guansong</creatorcontrib><creatorcontrib>Xu, Songcen</creatorcontrib><creatorcontrib>Zhang, Wei</creatorcontrib><creatorcontrib>Xu, Hang</creatorcontrib><title>LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model</title><description>Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tKxDAYhbNxIaMP4Mq8QGqb9BZ3Ukcd6ODC7svf6Z8aSJuaXmjf3mkVDhwOHD74CHkIfC9Mo8h_Arfo2eOhL7wgkFzekp8cVnSvWqlnelx6Y53uGlrgMrJm0jXW9DyZUTOz3a4rs21vB6gM0lMLDdKvtRu_cdADnTXQncYyawxU1sGoZ6QbfBq07ejZ1mjuyI0CM-D9fx9I8XYssg-Wf76fspecQZxIJqSfokCIIUF1TQAY8LSOQuQqFAIulxCqpFYqkakSkoukimMuFOd1ogSgOJDHP-zuXPZOt-DWcnMvd3fxC64ZVtM</recordid><startdate>20240318</startdate><enddate>20240318</enddate><creator>Huang, Runhui</creator><creator>Cai, Kaixin</creator><creator>Han, Jianhua</creator><creator>Liang, Xiaodan</creator><creator>Pei, Renjing</creator><creator>Lu, Guansong</creator><creator>Xu, Songcen</creator><creator>Zhang, Wei</creator><creator>Xu, Hang</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240318</creationdate><title>LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model</title><author>Huang, Runhui ; Cai, Kaixin ; Han, Jianhua ; Liang, Xiaodan ; Pei, Renjing ; Lu, Guansong ; Xu, Songcen ; Zhang, Wei ; Xu, Hang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-3908e3ea6a7ef7ef1ae128d54e2f433acc4ab7dff798f39237b6623f22d7f3ae3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Huang, Runhui</creatorcontrib><creatorcontrib>Cai, Kaixin</creatorcontrib><creatorcontrib>Han, Jianhua</creatorcontrib><creatorcontrib>Liang, Xiaodan</creatorcontrib><creatorcontrib>Pei, Renjing</creatorcontrib><creatorcontrib>Lu, Guansong</creatorcontrib><creatorcontrib>Xu, Songcen</creatorcontrib><creatorcontrib>Zhang, Wei</creatorcontrib><creatorcontrib>Xu, Hang</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Huang, Runhui</au><au>Cai, Kaixin</au><au>Han, Jianhua</au><au>Liang, Xiaodan</au><au>Pei, Renjing</au><au>Lu, Guansong</au><au>Xu, Songcen</au><au>Zhang, Wei</au><au>Xu, Hang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model</atitle><date>2024-03-18</date><risdate>2024</risdate><abstract>Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.</abstract><doi>10.48550/arxiv.2403.11929</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2403.11929
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2403_11929
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T11%3A43%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=LayerDiff:%20Exploring%20Text-guided%20Multi-layered%20Composable%20Image%20Synthesis%20via%20Layer-Collaborative%20Diffusion%20Model&rft.au=Huang,%20Runhui&rft.date=2024-03-18&rft_id=info:doi/10.48550/arxiv.2403.11929&rft_dat=%3Carxiv_GOX%3E2403_11929%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true