Disentangling Structure and Appearance in ViT Feature Space

We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target ap...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-11
Hauptverfasser:	Tumanyan, Narek, Bar-Tal, Omer, Shir Amir, Bagon, Shai, Dekel, Tali
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition Image segmentation Representations Semantic segmentation Semantics Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Tumanyan, Narek Bar-Tal, Omer Shir Amir Bagon, Shai Dekel, Tali
description	We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.
doi_str_mv	10.48550/arxiv.2311.12193
format	Article
fullrecord	<record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2311_12193</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2892413933</sourcerecordid><originalsourceid>FETCH-LOGICAL-a523-90c17b0f24d8d7c981d6a9c9fe438d2b439d7d03dc2bbb8d9c1616f0034b19883</originalsourceid><addsrcrecordid>eNotj01Lw0AURQdBsNT-AFcOuE6ceW-SzOCqVKtCwUWD2zBfKVPqNE4S0X9vbF3dxT1c7iHkhrNcyKJg9zp9h68ckPOcA1d4QWaAyDMpAK7Iou_3jDEoKygKnJGHx9D7OOi4O4S4o9shjXYYk6c6OrrsOq-TjtbTEOl7qOna61O77bT11-Sy1YfeL_5zTur1U716yTZvz6-r5SbTBWCmmOWVYS0IJ11lleSu1Mqq1guUDoxA5SrH0FkwxkinLC952TKGwnAlJc7J7Xn2ZNZ0KXzo9NP8GTYnw4m4OxNdOn6Ovh-a_XFMcfrUgFQgOE4Q_gKyb1Jb</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2892413933</pqid></control><display><type>article</type><title>Disentangling Structure and Appearance in ViT Feature Space</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Tumanyan, Narek ; Bar-Tal, Omer ; Shir Amir ; Bagon, Shai ; Dekel, Tali</creator><creatorcontrib>Tumanyan, Narek ; Bar-Tal, Omer ; Shir Amir ; Bagon, Shai ; Dekel, Tali</creatorcontrib><description>We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2311.12193</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Computer Science - Computer Vision and Pattern Recognition ; Image segmentation ; Representations ; Semantic segmentation ; Semantics ; Training</subject><ispartof>arXiv.org, 2023-11</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,780,881,27902</link.rule.ids><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.12193$$DView paper in arXiv$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.1145/3630096$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink></links><search><creatorcontrib>Tumanyan, Narek</creatorcontrib><creatorcontrib>Bar-Tal, Omer</creatorcontrib><creatorcontrib>Shir Amir</creatorcontrib><creatorcontrib>Bagon, Shai</creatorcontrib><creatorcontrib>Dekel, Tali</creatorcontrib><title>Disentangling Structure and Appearance in ViT Feature Space</title><title>arXiv.org</title><description>We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Image segmentation</subject><subject>Representations</subject><subject>Semantic segmentation</subject><subject>Semantics</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><sourceid>GOX</sourceid><recordid>eNotj01Lw0AURQdBsNT-AFcOuE6ceW-SzOCqVKtCwUWD2zBfKVPqNE4S0X9vbF3dxT1c7iHkhrNcyKJg9zp9h68ckPOcA1d4QWaAyDMpAK7Iou_3jDEoKygKnJGHx9D7OOi4O4S4o9shjXYYk6c6OrrsOq-TjtbTEOl7qOna61O77bT11-Sy1YfeL_5zTur1U716yTZvz6-r5SbTBWCmmOWVYS0IJ11lleSu1Mqq1guUDoxA5SrH0FkwxkinLC952TKGwnAlJc7J7Xn2ZNZ0KXzo9NP8GTYnw4m4OxNdOn6Ovh-a_XFMcfrUgFQgOE4Q_gKyb1Jb</recordid><startdate>20231120</startdate><enddate>20231120</enddate><creator>Tumanyan, Narek</creator><creator>Bar-Tal, Omer</creator><creator>Shir Amir</creator><creator>Bagon, Shai</creator><creator>Dekel, Tali</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231120</creationdate><title>Disentangling Structure and Appearance in ViT Feature Space</title><author>Tumanyan, Narek ; Bar-Tal, Omer ; Shir Amir ; Bagon, Shai ; Dekel, Tali</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a523-90c17b0f24d8d7c981d6a9c9fe438d2b439d7d03dc2bbb8d9c1616f0034b19883</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Image segmentation</topic><topic>Representations</topic><topic>Semantic segmentation</topic><topic>Semantics</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Tumanyan, Narek</creatorcontrib><creatorcontrib>Bar-Tal, Omer</creatorcontrib><creatorcontrib>Shir Amir</creatorcontrib><creatorcontrib>Bagon, Shai</creatorcontrib><creatorcontrib>Dekel, Tali</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tumanyan, Narek</au><au>Bar-Tal, Omer</au><au>Shir Amir</au><au>Bagon, Shai</au><au>Dekel, Tali</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Disentangling Structure and Appearance in ViT Feature Space</atitle><jtitle>arXiv.org</jtitle><date>2023-11-20</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2311.12193</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-11
issn	2331-8422
language	eng
recordid	cdi_arxiv_primary_2311_12193
source	arXiv.org; Free E- Journals
subjects	Computer Science - Computer Vision and Pattern Recognition Image segmentation Representations Semantic segmentation Semantics Training
title	Disentangling Structure and Appearance in ViT Feature Space
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T13%3A06%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Disentangling%20Structure%20and%20Appearance%20in%20ViT%20Feature%20Space&rft.jtitle=arXiv.org&rft.au=Tumanyan,%20Narek&rft.date=2023-11-20&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2311.12193&rft_dat=%3Cproquest_arxiv%3E2892413933%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2892413933&rft_id=info:pmid/&rfr_iscdi=true