Disentangling Structure and Appearance in ViT Feature Space

We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are “painted” with the visual appearance of their semantically related objects in a target appearance i...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on graphics 2023-11, Vol.43 (1), p.1-16, Article 11
Hauptverfasser:	Tumanyan, Narek, Bar-Tal, Omer, Amir, Shir, Bagon, Shai, Dekel, Tali
Format:	Artikel
Sprache:	eng
Schlagworte:	Appearance and texture representations Computing methodologies Image processing Image-based rendering Shape representations
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	16
container_issue	1
container_start_page	1
container_title	ACM transactions on graphics
container_volume	43
creator	Tumanyan, Narek Bar-Tal, Omer Amir, Shir Bagon, Shai Dekel, Tali
description	We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are “painted” with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer – “Splice”, which works by training a generator on a single and arbitrary pair of structure-appearance images, and “SpliceNet”, a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.
doi_str_mv	10.1145/3630096
format	Article
fullrecord	<record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3630096</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3630096</sourcerecordid><originalsourceid>FETCH-LOGICAL-a239t-14f96526e54a1484d2007146b4865f3baff8fc53b1b47de194d53236a076eed13</originalsourceid><addsrcrecordid>eNo9j0tLw0AURgdRMFZx72p2rqL3Zl4Jrkq1KhRctLoNN5M7JdLGMJMu_Pc-Wl19i3P44AhxiXCDqM2tsgqgskciQ2Nc7pQtj0UGTkEOCvBUnKX0DgBWa5uJu_sucT9Sv950_Voux7jz4y6ypL6V02FgitR7ll0v37qVnDP90uVAns_FSaBN4ovDTsTr_GE1e8oXL4_Ps-kip0JVY446VNYUlo0m1KVuCwCH2ja6tCaohkIogzeqwUa7lrHSrVGFsgTOMreoJuJ6_-vjR0qRQz3Ebkvxs0aof5rrQ_O3ebU3yW__pT_4BVe9T1s</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Disentangling Structure and Appearance in ViT Feature Space</title><source>ACM Digital Library</source><creator>Tumanyan, Narek ; Bar-Tal, Omer ; Amir, Shir ; Bagon, Shai ; Dekel, Tali</creator><creatorcontrib>Tumanyan, Narek ; Bar-Tal, Omer ; Amir, Shir ; Bagon, Shai ; Dekel, Tali</creatorcontrib><description>We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are “painted” with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer – “Splice”, which works by training a generator on a single and arbitrary pair of structure-appearance images, and “SpliceNet”, a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.</description><identifier>ISSN: 0730-0301</identifier><identifier>EISSN: 1557-7368</identifier><identifier>DOI: 10.1145/3630096</identifier><language>eng</language><publisher>New York, NY, USA: ACM</publisher><subject>Appearance and texture representations ; Computing methodologies ; Image processing ; Image-based rendering ; Shape representations</subject><ispartof>ACM transactions on graphics, 2023-11, Vol.43 (1), p.1-16, Article 11</ispartof><rights>Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a239t-14f96526e54a1484d2007146b4865f3baff8fc53b1b47de194d53236a076eed13</cites><orcidid>0009-0004-4453-6920 ; 0000-0003-2734-8480 ; 0000-0002-6057-4263 ; 0000-0003-1622-3674 ; 0000-0003-3703-0783</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3630096$$EPDF$$P50$$Gacm$$H</linktopdf><link.rule.ids>314,780,784,2282,27924,27925,40196,76228</link.rule.ids></links><search><creatorcontrib>Tumanyan, Narek</creatorcontrib><creatorcontrib>Bar-Tal, Omer</creatorcontrib><creatorcontrib>Amir, Shir</creatorcontrib><creatorcontrib>Bagon, Shai</creatorcontrib><creatorcontrib>Dekel, Tali</creatorcontrib><title>Disentangling Structure and Appearance in ViT Feature Space</title><title>ACM transactions on graphics</title><addtitle>ACM TOG</addtitle><description>We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are “painted” with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer – “Splice”, which works by training a generator on a single and arbitrary pair of structure-appearance images, and “SpliceNet”, a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.</description><subject>Appearance and texture representations</subject><subject>Computing methodologies</subject><subject>Image processing</subject><subject>Image-based rendering</subject><subject>Shape representations</subject><issn>0730-0301</issn><issn>1557-7368</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNo9j0tLw0AURgdRMFZx72p2rqL3Zl4Jrkq1KhRctLoNN5M7JdLGMJMu_Pc-Wl19i3P44AhxiXCDqM2tsgqgskciQ2Nc7pQtj0UGTkEOCvBUnKX0DgBWa5uJu_sucT9Sv950_Voux7jz4y6ypL6V02FgitR7ll0v37qVnDP90uVAns_FSaBN4ovDTsTr_GE1e8oXL4_Ps-kip0JVY446VNYUlo0m1KVuCwCH2ja6tCaohkIogzeqwUa7lrHSrVGFsgTOMreoJuJ6_-vjR0qRQz3Ebkvxs0aof5rrQ_O3ebU3yW__pT_4BVe9T1s</recordid><startdate>20231130</startdate><enddate>20231130</enddate><creator>Tumanyan, Narek</creator><creator>Bar-Tal, Omer</creator><creator>Amir, Shir</creator><creator>Bagon, Shai</creator><creator>Dekel, Tali</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0009-0004-4453-6920</orcidid><orcidid>https://orcid.org/0000-0003-2734-8480</orcidid><orcidid>https://orcid.org/0000-0002-6057-4263</orcidid><orcidid>https://orcid.org/0000-0003-1622-3674</orcidid><orcidid>https://orcid.org/0000-0003-3703-0783</orcidid></search><sort><creationdate>20231130</creationdate><title>Disentangling Structure and Appearance in ViT Feature Space</title><author>Tumanyan, Narek ; Bar-Tal, Omer ; Amir, Shir ; Bagon, Shai ; Dekel, Tali</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a239t-14f96526e54a1484d2007146b4865f3baff8fc53b1b47de194d53236a076eed13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Appearance and texture representations</topic><topic>Computing methodologies</topic><topic>Image processing</topic><topic>Image-based rendering</topic><topic>Shape representations</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tumanyan, Narek</creatorcontrib><creatorcontrib>Bar-Tal, Omer</creatorcontrib><creatorcontrib>Amir, Shir</creatorcontrib><creatorcontrib>Bagon, Shai</creatorcontrib><creatorcontrib>Dekel, Tali</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on graphics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tumanyan, Narek</au><au>Bar-Tal, Omer</au><au>Amir, Shir</au><au>Bagon, Shai</au><au>Dekel, Tali</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Disentangling Structure and Appearance in ViT Feature Space</atitle><jtitle>ACM transactions on graphics</jtitle><stitle>ACM TOG</stitle><date>2023-11-30</date><risdate>2023</risdate><volume>43</volume><issue>1</issue><spage>1</spage><epage>16</epage><pages>1-16</pages><artnum>11</artnum><issn>0730-0301</issn><eissn>1557-7368</eissn><abstract>We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are “painted” with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer – “Splice”, which works by training a generator on a single and arbitrary pair of structure-appearance images, and “SpliceNet”, a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.</abstract><cop>New York, NY, USA</cop><pub>ACM</pub><doi>10.1145/3630096</doi><tpages>16</tpages><orcidid>https://orcid.org/0009-0004-4453-6920</orcidid><orcidid>https://orcid.org/0000-0003-2734-8480</orcidid><orcidid>https://orcid.org/0000-0002-6057-4263</orcidid><orcidid>https://orcid.org/0000-0003-1622-3674</orcidid><orcidid>https://orcid.org/0000-0003-3703-0783</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0730-0301
ispartof	ACM transactions on graphics, 2023-11, Vol.43 (1), p.1-16, Article 11
issn	0730-0301 1557-7368
language	eng
recordid	cdi_crossref_primary_10_1145_3630096
source	ACM Digital Library
subjects	Appearance and texture representations Computing methodologies Image processing Image-based rendering Shape representations
title	Disentangling Structure and Appearance in ViT Feature Space
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T04%3A32%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Disentangling%20Structure%20and%20Appearance%20in%20ViT%20Feature%20Space&rft.jtitle=ACM%20transactions%20on%20graphics&rft.au=Tumanyan,%20Narek&rft.date=2023-11-30&rft.volume=43&rft.issue=1&rft.spage=1&rft.epage=16&rft.pages=1-16&rft.artnum=11&rft.issn=0730-0301&rft.eissn=1557-7368&rft_id=info:doi/10.1145/3630096&rft_dat=%3Cacm_cross%3E3630096%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true