SwinStyleformer is a favorable choice for image inversion
This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper proposes the first pure Transformer structure inversion network
called SwinStyleformer, which can compensate for the shortcomings of the CNNs
inversion framework by handling long-range dependencies and learning the global
structure of objects. Experiments found that the inversion network with the
Transformer backbone could not successfully invert the image. The above
phenomena arise from the differences between CNNs and Transformers, such as the
self-attention weights favoring image structure ignoring image details compared
to convolution, the lack of multi-scale properties of Transformer, and the
distribution differences between the latent code extracted by the Transformer
and the StyleGAN style vector. To address these differences, we employ the Swin
Transformer with a smaller window size as the backbone of the SwinStyleformer
to enhance the local detail of the inversion image. Meanwhile, we design a
Transformer block based on learnable queries. Compared to the self-attention
transformer block, the Transformer block based on learnable queries provides
greater adaptability and flexibility, enabling the model to update the
attention weights according to specific tasks. Thus, the inversion focus is not
limited to the image structure. To further introduce multi-scale properties, we
design multi-scale connections in the extraction of feature maps. Multi-scale
connections allow the model to gain a comprehensive understanding of the image
to avoid loss of detail due to global modeling. Moreover, we propose an
inversion discriminator and distribution alignment loss to minimize the
distribution differences. Based on the above designs, our SwinStyleformer
successfully solves the Transformer's inversion failure issue and demonstrates
SOTA performance in image inversion and several related vision tasks. |
---|---|
DOI: | 10.48550/arxiv.2406.13153 |