SwinStyleformer is a favorable choice for image inversion

This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Mao, Jiawei, Zhao, Guangyi, Yin, Xuesong, Chang, Yuanqi
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Mao, Jiawei Zhao, Guangyi Yin, Xuesong Chang, Yuanqi
description	This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network with the Transformer backbone could not successfully invert the image. The above phenomena arise from the differences between CNNs and Transformers, such as the self-attention weights favoring image structure ignoring image details compared to convolution, the lack of multi-scale properties of Transformer, and the distribution differences between the latent code extracted by the Transformer and the StyleGAN style vector. To address these differences, we employ the Swin Transformer with a smaller window size as the backbone of the SwinStyleformer to enhance the local detail of the inversion image. Meanwhile, we design a Transformer block based on learnable queries. Compared to the self-attention transformer block, the Transformer block based on learnable queries provides greater adaptability and flexibility, enabling the model to update the attention weights according to specific tasks. Thus, the inversion focus is not limited to the image structure. To further introduce multi-scale properties, we design multi-scale connections in the extraction of feature maps. Multi-scale connections allow the model to gain a comprehensive understanding of the image to avoid loss of detail due to global modeling. Moreover, we propose an inversion discriminator and distribution alignment loss to minimize the distribution differences. Based on the above designs, our SwinStyleformer successfully solves the Transformer's inversion failure issue and demonstrates SOTA performance in image inversion and several related vision tasks.
doi_str_mv	10.48550/arxiv.2406.13153
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_13153</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_13153</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-8b83a060ac72d6188dc6ff111d0a94f547ff65a97f8e7aa2a7369733941eaac23</originalsourceid><addsrcrecordid>eNotj71OwzAURr0woMIDMOEXSLBz_TuiCgpSJYZ2j26de8FSmlROFdq3pxSmbzjSp3OEeNCqNsFa9YTllOe6McrVGrSFWxE333nYHM898Vj2VGSeJErGeSy460mmrzEnkhco8x4_SeZhpjLlcbgTN4z9RPf_uxDb15ft8q1af6zel8_rCp2HKuwCoHIKk286p0PokmPWWncKo2FrPLOzGD0H8ogNenDRA0SjCTE1sBCPf7dX9_ZQLhrl3P42tNcG-AEGPEFc</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SwinStyleformer is a favorable choice for image inversion</title><source>arXiv.org</source><creator>Mao, Jiawei ; Zhao, Guangyi ; Yin, Xuesong ; Chang, Yuanqi</creator><creatorcontrib>Mao, Jiawei ; Zhao, Guangyi ; Yin, Xuesong ; Chang, Yuanqi</creatorcontrib><description>This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network with the Transformer backbone could not successfully invert the image. The above phenomena arise from the differences between CNNs and Transformers, such as the self-attention weights favoring image structure ignoring image details compared to convolution, the lack of multi-scale properties of Transformer, and the distribution differences between the latent code extracted by the Transformer and the StyleGAN style vector. To address these differences, we employ the Swin Transformer with a smaller window size as the backbone of the SwinStyleformer to enhance the local detail of the inversion image. Meanwhile, we design a Transformer block based on learnable queries. Compared to the self-attention transformer block, the Transformer block based on learnable queries provides greater adaptability and flexibility, enabling the model to update the attention weights according to specific tasks. Thus, the inversion focus is not limited to the image structure. To further introduce multi-scale properties, we design multi-scale connections in the extraction of feature maps. Multi-scale connections allow the model to gain a comprehensive understanding of the image to avoid loss of detail due to global modeling. Moreover, we propose an inversion discriminator and distribution alignment loss to minimize the distribution differences. Based on the above designs, our SwinStyleformer successfully solves the Transformer's inversion failure issue and demonstrates SOTA performance in image inversion and several related vision tasks.</description><identifier>DOI: 10.48550/arxiv.2406.13153</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.13153$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.13153$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Mao, Jiawei</creatorcontrib><creatorcontrib>Zhao, Guangyi</creatorcontrib><creatorcontrib>Yin, Xuesong</creatorcontrib><creatorcontrib>Chang, Yuanqi</creatorcontrib><title>SwinStyleformer is a favorable choice for image inversion</title><description>This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network with the Transformer backbone could not successfully invert the image. The above phenomena arise from the differences between CNNs and Transformers, such as the self-attention weights favoring image structure ignoring image details compared to convolution, the lack of multi-scale properties of Transformer, and the distribution differences between the latent code extracted by the Transformer and the StyleGAN style vector. To address these differences, we employ the Swin Transformer with a smaller window size as the backbone of the SwinStyleformer to enhance the local detail of the inversion image. Meanwhile, we design a Transformer block based on learnable queries. Compared to the self-attention transformer block, the Transformer block based on learnable queries provides greater adaptability and flexibility, enabling the model to update the attention weights according to specific tasks. Thus, the inversion focus is not limited to the image structure. To further introduce multi-scale properties, we design multi-scale connections in the extraction of feature maps. Multi-scale connections allow the model to gain a comprehensive understanding of the image to avoid loss of detail due to global modeling. Moreover, we propose an inversion discriminator and distribution alignment loss to minimize the distribution differences. Based on the above designs, our SwinStyleformer successfully solves the Transformer's inversion failure issue and demonstrates SOTA performance in image inversion and several related vision tasks.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj71OwzAURr0woMIDMOEXSLBz_TuiCgpSJYZ2j26de8FSmlROFdq3pxSmbzjSp3OEeNCqNsFa9YTllOe6McrVGrSFWxE333nYHM898Vj2VGSeJErGeSy460mmrzEnkhco8x4_SeZhpjLlcbgTN4z9RPf_uxDb15ft8q1af6zel8_rCp2HKuwCoHIKk286p0PokmPWWncKo2FrPLOzGD0H8ogNenDRA0SjCTE1sBCPf7dX9_ZQLhrl3P42tNcG-AEGPEFc</recordid><startdate>20240618</startdate><enddate>20240618</enddate><creator>Mao, Jiawei</creator><creator>Zhao, Guangyi</creator><creator>Yin, Xuesong</creator><creator>Chang, Yuanqi</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240618</creationdate><title>SwinStyleformer is a favorable choice for image inversion</title><author>Mao, Jiawei ; Zhao, Guangyi ; Yin, Xuesong ; Chang, Yuanqi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-8b83a060ac72d6188dc6ff111d0a94f547ff65a97f8e7aa2a7369733941eaac23</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Mao, Jiawei</creatorcontrib><creatorcontrib>Zhao, Guangyi</creatorcontrib><creatorcontrib>Yin, Xuesong</creatorcontrib><creatorcontrib>Chang, Yuanqi</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mao, Jiawei</au><au>Zhao, Guangyi</au><au>Yin, Xuesong</au><au>Chang, Yuanqi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SwinStyleformer is a favorable choice for image inversion</atitle><date>2024-06-18</date><risdate>2024</risdate><abstract>This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network with the Transformer backbone could not successfully invert the image. The above phenomena arise from the differences between CNNs and Transformers, such as the self-attention weights favoring image structure ignoring image details compared to convolution, the lack of multi-scale properties of Transformer, and the distribution differences between the latent code extracted by the Transformer and the StyleGAN style vector. To address these differences, we employ the Swin Transformer with a smaller window size as the backbone of the SwinStyleformer to enhance the local detail of the inversion image. Meanwhile, we design a Transformer block based on learnable queries. Compared to the self-attention transformer block, the Transformer block based on learnable queries provides greater adaptability and flexibility, enabling the model to update the attention weights according to specific tasks. Thus, the inversion focus is not limited to the image structure. To further introduce multi-scale properties, we design multi-scale connections in the extraction of feature maps. Multi-scale connections allow the model to gain a comprehensive understanding of the image to avoid loss of detail due to global modeling. Moreover, we propose an inversion discriminator and distribution alignment loss to minimize the distribution differences. Based on the above designs, our SwinStyleformer successfully solves the Transformer's inversion failure issue and demonstrates SOTA performance in image inversion and several related vision tasks.</abstract><doi>10.48550/arxiv.2406.13153</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2406.13153
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2406_13153
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	SwinStyleformer is a favorable choice for image inversion
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T13%3A41%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SwinStyleformer%20is%20a%20favorable%20choice%20for%20image%20inversion&rft.au=Mao,%20Jiawei&rft.date=2024-06-18&rft_id=info:doi/10.48550/arxiv.2406.13153&rft_dat=%3Carxiv_GOX%3E2406_13153%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true