Prompt-to-Prompt Image Editing with Cross Attention Control

Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their int...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Hertz, Amir, Mokady, Ron, Tenenbaum, Jay, Aberman, Kfir, Pritch, Yael, Cohen-Or, Daniel
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Graphics Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Hertz, Amir Mokady, Ron Tenenbaum, Jay Aberman, Kfir Pritch, Yael Cohen-Or, Daniel
description	Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modification of the text prompt often leads to a completely different outcome. State-of-the-art methods mitigate this by requiring the users to provide a spatial mask to localize the edit, hence, ignoring the original structure and content within the masked region. In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. To this end, we analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt. With this observation, we present several applications which monitor the image synthesis by editing the textual prompt only. This includes localized editing by replacing a word, global editing by adding a specification, and even delicately controlling the extent to which a word is reflected in the image. We present our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.
doi_str_mv	10.48550/arxiv.2208.01626
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2208_01626</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2208_01626</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-36ef1d6c3b5a86f429fc5a6428926dd1c9e16f2a405c3f8525d19d64a9c2c3963</originalsourceid><addsrcrecordid>eNotj7lOAzEURd1QoMAHUOEf8OD1YStVNAoQKVIo0o8eXhJLmXHksVj-HpFQnVsd3UPIg-CdtsbwJ6zf-bOTktuOC5BwS5bvtYznxlph10U3Ix4iXYfc8nSgX7kdaV_LPNNVa3FquUy0L1Or5XRHbhKe5nj_zwXZv6z3_Rvb7l43_WrLEJ6BKYhJBPDqw6CFpKVL3iBoaZ2EEIR3UUCSqLnxKlkjTRAugEbnpVcO1II8XrWX98O55hHrz_BXMVwq1C9pd0Gk</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Prompt-to-Prompt Image Editing with Cross Attention Control</title><source>arXiv.org</source><creator>Hertz, Amir ; Mokady, Ron ; Tenenbaum, Jay ; Aberman, Kfir ; Pritch, Yael ; Cohen-Or, Daniel</creator><creatorcontrib>Hertz, Amir ; Mokady, Ron ; Tenenbaum, Jay ; Aberman, Kfir ; Pritch, Yael ; Cohen-Or, Daniel</creatorcontrib><description>Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modification of the text prompt often leads to a completely different outcome. State-of-the-art methods mitigate this by requiring the users to provide a spatial mask to localize the edit, hence, ignoring the original structure and content within the masked region. In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. To this end, we analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt. With this observation, we present several applications which monitor the image synthesis by editing the textual prompt only. This includes localized editing by replacing a word, global editing by adding a specification, and even delicately controlling the extent to which a word is reflected in the image. We present our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.</description><identifier>DOI: 10.48550/arxiv.2208.01626</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Graphics ; Computer Science - Learning</subject><creationdate>2022-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2208.01626$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2208.01626$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Hertz, Amir</creatorcontrib><creatorcontrib>Mokady, Ron</creatorcontrib><creatorcontrib>Tenenbaum, Jay</creatorcontrib><creatorcontrib>Aberman, Kfir</creatorcontrib><creatorcontrib>Pritch, Yael</creatorcontrib><creatorcontrib>Cohen-Or, Daniel</creatorcontrib><title>Prompt-to-Prompt Image Editing with Cross Attention Control</title><description>Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modification of the text prompt often leads to a completely different outcome. State-of-the-art methods mitigate this by requiring the users to provide a spatial mask to localize the edit, hence, ignoring the original structure and content within the masked region. In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. To this end, we analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt. With this observation, we present several applications which monitor the image synthesis by editing the textual prompt only. This includes localized editing by replacing a word, global editing by adding a specification, and even delicately controlling the extent to which a word is reflected in the image. We present our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Graphics</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj7lOAzEURd1QoMAHUOEf8OD1YStVNAoQKVIo0o8eXhJLmXHksVj-HpFQnVsd3UPIg-CdtsbwJ6zf-bOTktuOC5BwS5bvtYznxlph10U3Ix4iXYfc8nSgX7kdaV_LPNNVa3FquUy0L1Or5XRHbhKe5nj_zwXZv6z3_Rvb7l43_WrLEJ6BKYhJBPDqw6CFpKVL3iBoaZ2EEIR3UUCSqLnxKlkjTRAugEbnpVcO1II8XrWX98O55hHrz_BXMVwq1C9pd0Gk</recordid><startdate>20220802</startdate><enddate>20220802</enddate><creator>Hertz, Amir</creator><creator>Mokady, Ron</creator><creator>Tenenbaum, Jay</creator><creator>Aberman, Kfir</creator><creator>Pritch, Yael</creator><creator>Cohen-Or, Daniel</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220802</creationdate><title>Prompt-to-Prompt Image Editing with Cross Attention Control</title><author>Hertz, Amir ; Mokady, Ron ; Tenenbaum, Jay ; Aberman, Kfir ; Pritch, Yael ; Cohen-Or, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-36ef1d6c3b5a86f429fc5a6428926dd1c9e16f2a405c3f8525d19d64a9c2c3963</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Graphics</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Hertz, Amir</creatorcontrib><creatorcontrib>Mokady, Ron</creatorcontrib><creatorcontrib>Tenenbaum, Jay</creatorcontrib><creatorcontrib>Aberman, Kfir</creatorcontrib><creatorcontrib>Pritch, Yael</creatorcontrib><creatorcontrib>Cohen-Or, Daniel</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hertz, Amir</au><au>Mokady, Ron</au><au>Tenenbaum, Jay</au><au>Aberman, Kfir</au><au>Pritch, Yael</au><au>Cohen-Or, Daniel</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Prompt-to-Prompt Image Editing with Cross Attention Control</atitle><date>2022-08-02</date><risdate>2022</risdate><abstract>Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modification of the text prompt often leads to a completely different outcome. State-of-the-art methods mitigate this by requiring the users to provide a spatial mask to localize the edit, hence, ignoring the original structure and content within the masked region. In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. To this end, we analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt. With this observation, we present several applications which monitor the image synthesis by editing the textual prompt only. This includes localized editing by replacing a word, global editing by adding a specification, and even delicately controlling the extent to which a word is reflected in the image. We present our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.</abstract><doi>10.48550/arxiv.2208.01626</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2208.01626
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2208_01626
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Graphics Computer Science - Learning
title	Prompt-to-Prompt Image Editing with Cross Attention Control
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-23T13%3A25%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Prompt-to-Prompt%20Image%20Editing%20with%20Cross%20Attention%20Control&rft.au=Hertz,%20Amir&rft.date=2022-08-02&rft_id=info:doi/10.48550/arxiv.2208.01626&rft_dat=%3Carxiv_GOX%3E2208_01626%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true