An Efficient Transformer Based on Global and Local Self-attention for Face Photo-Sketch Synthesis

Face photo-sketch synthesis tasks have been dominated by convolutional neural networks (CNNs), especially CNN-based generative adversarial networks (GANs), because of their strong texture modeling capabilities and thus their ability to generate more realistic face photos/sketches beyond traditional...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on image processing 2023-01, Vol.PP, p.1-1
Hauptverfasser: Yu, Wangbo, Zhu, Mingrui, Wang, Nannan, Wang, Xiaoyu, Gao, Xinbo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Face photo-sketch synthesis tasks have been dominated by convolutional neural networks (CNNs), especially CNN-based generative adversarial networks (GANs), because of their strong texture modeling capabilities and thus their ability to generate more realistic face photos/sketches beyond traditional methods. However, due to CNNs' locality and spatial invariance properties, there have weaknesses in capturing the global and structural information which are extremely important for face images. Inspired by the recent phenomenal success of the Transformer in vision tasks, we propose replacing CNNs with Transformers that are able to model long-range dependencies to synthesize more structured and realistic face images. However, the existing vision Transformers are mainly designed for high-level vision tasks and lack the dense prediction ability to generate high resolution images due to the quadratic computational complexity of their self-attention mechanism. In addition, the original Transformer is not capable of modeling local correlations which is an important skill for image generation. To address these challenges, we propose two types of memory-friendly Transformer encoders, one for processing local correlations via local self-attention and another for modeling global information via global self-attention. By integrating the two proposed Transformer encoders, we present an efficient GL-Transformer for face photo-sketch synthesis, which can synthesize realistic face photo/sketch images from coarse to fine. Extensive experiments demonstrate that our model achieves a comparable or better performance beyond the state-of-the-art CNN-based methods both qualitatively and quantitatively.
ISSN:1057-7149
1941-0042
DOI:10.1109/TIP.2022.3229614