Stable Flow: Vital Layers for Training-Free Image Editing

Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Avrahami, Omri, Patashnik, Or, Fried, Ohad, Nemchinov, Egor, Aberman, Kfir, Lischinski, Dani, Cohen-Or, Daniel
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition Computer Science - Graphics Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Avrahami, Omri Patashnik, Or Fried, Ohad Nemchinov, Egor Aberman, Kfir Lischinski, Dani Cohen-Or, Daniel
description	Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify "vital layers" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications. The project page is available at https://omriavrahami.com/stable-flow
doi_str_mv	10.48550/arxiv.2411.14430
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2411_14430</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2411_14430</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2411_144303</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE01DM0MTE24GSwDC5JTMpJVXDLyS-3UgjLLEnMUfBJrEwtKlZIyy9SCClKzMzLzEvXdStKTVXwzE1MT1VwTcksAQrxMLCmJeYUp_JCaW4GeTfXEGcPXbAl8QVFmbmJRZXxIMviwZYZE1YBANP0Mv8</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Stable Flow: Vital Layers for Training-Free Image Editing</title><source>arXiv.org</source><creator>Avrahami, Omri ; Patashnik, Or ; Fried, Ohad ; Nemchinov, Egor ; Aberman, Kfir ; Lischinski, Dani ; Cohen-Or, Daniel</creator><creatorcontrib>Avrahami, Omri ; Patashnik, Or ; Fried, Ohad ; Nemchinov, Egor ; Aberman, Kfir ; Lischinski, Dani ; Cohen-Or, Daniel</creatorcontrib><description>Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify "vital layers" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications. The project page is available at https://omriavrahami.com/stable-flow</description><identifier>DOI: 10.48550/arxiv.2411.14430</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Graphics ; Computer Science - Learning</subject><creationdate>2024-11</creationdate><rights>http://creativecommons.org/licenses/by-nc-nd/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2411.14430$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2411.14430$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Avrahami, Omri</creatorcontrib><creatorcontrib>Patashnik, Or</creatorcontrib><creatorcontrib>Fried, Ohad</creatorcontrib><creatorcontrib>Nemchinov, Egor</creatorcontrib><creatorcontrib>Aberman, Kfir</creatorcontrib><creatorcontrib>Lischinski, Dani</creatorcontrib><creatorcontrib>Cohen-Or, Daniel</creatorcontrib><title>Stable Flow: Vital Layers for Training-Free Image Editing</title><description>Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify "vital layers" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications. The project page is available at https://omriavrahami.com/stable-flow</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Graphics</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE01DM0MTE24GSwDC5JTMpJVXDLyS-3UgjLLEnMUfBJrEwtKlZIyy9SCClKzMzLzEvXdStKTVXwzE1MT1VwTcksAQrxMLCmJeYUp_JCaW4GeTfXEGcPXbAl8QVFmbmJRZXxIMviwZYZE1YBANP0Mv8</recordid><startdate>20241121</startdate><enddate>20241121</enddate><creator>Avrahami, Omri</creator><creator>Patashnik, Or</creator><creator>Fried, Ohad</creator><creator>Nemchinov, Egor</creator><creator>Aberman, Kfir</creator><creator>Lischinski, Dani</creator><creator>Cohen-Or, Daniel</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241121</creationdate><title>Stable Flow: Vital Layers for Training-Free Image Editing</title><author>Avrahami, Omri ; Patashnik, Or ; Fried, Ohad ; Nemchinov, Egor ; Aberman, Kfir ; Lischinski, Dani ; Cohen-Or, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2411_144303</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Graphics</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Avrahami, Omri</creatorcontrib><creatorcontrib>Patashnik, Or</creatorcontrib><creatorcontrib>Fried, Ohad</creatorcontrib><creatorcontrib>Nemchinov, Egor</creatorcontrib><creatorcontrib>Aberman, Kfir</creatorcontrib><creatorcontrib>Lischinski, Dani</creatorcontrib><creatorcontrib>Cohen-Or, Daniel</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Avrahami, Omri</au><au>Patashnik, Or</au><au>Fried, Ohad</au><au>Nemchinov, Egor</au><au>Aberman, Kfir</au><au>Lischinski, Dani</au><au>Cohen-Or, Daniel</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Stable Flow: Vital Layers for Training-Free Image Editing</atitle><date>2024-11-21</date><risdate>2024</risdate><abstract>Diffusion models have revolutionized the field of content synthesis and editing. Recent models have replaced the traditional UNet architecture with the Diffusion Transformer (DiT), and employed flow-matching for improved training and sampling. However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. The main challenge is that, unlike the UNet-based models, DiT lacks a coarse-to-fine synthesis structure, making it unclear in which layers to perform the injection. Therefore, we propose an automatic method to identify "vital layers" within DiT, crucial for image formation, and demonstrate how these layers facilitate a range of controlled stable edits, from non-rigid modifications to object addition, using the same mechanism. Next, to enable real-image editing, we introduce an improved image inversion method for flow models. Finally, we evaluate our approach through qualitative and quantitative comparisons, along with a user study, and demonstrate its effectiveness across multiple applications. The project page is available at https://omriavrahami.com/stable-flow</abstract><doi>10.48550/arxiv.2411.14430</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2411.14430
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2411_14430
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition Computer Science - Graphics Computer Science - Learning
title	Stable Flow: Vital Layers for Training-Free Image Editing
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T01%3A57%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Stable%20Flow:%20Vital%20Layers%20for%20Training-Free%20Image%20Editing&rft.au=Avrahami,%20Omri&rft.date=2024-11-21&rft_id=info:doi/10.48550/arxiv.2411.14430&rft_dat=%3Carxiv_GOX%3E2411_14430%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true