An Efficient Latent Style Guided Transformer-CNN Framework for Face Super-Resolution

In the Face Super-Resolution (FSR) task, it is important to precisely recover facial textures while maintaining facial contours for realistic high resolution faces. Although several CNN-based FSR methods have achieved great performance, they fail in restoring the facial contours due to the limitatio...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2024-01, Vol.26, p.1-11
Hauptverfasser: Qi, Haoran, Qiu, Yuwei, Luo, Xing, Jin, Zhi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In the Face Super-Resolution (FSR) task, it is important to precisely recover facial textures while maintaining facial contours for realistic high resolution faces. Although several CNN-based FSR methods have achieved great performance, they fail in restoring the facial contours due to the limitation of local convolutions. In contrast, Transformer-based methods which use self-attention as the basic component, are expert in modeling long-range dependencies between image patches. However, learning long-range dependencies often deteriorates facial textures due to the lack of locality. Therefore, a question is naturally raised: how to effectively combine the superiority of CNN and Transformer for better reconstructing faces? To address this issue, we propose an Efficient Latent Style guided Transformer-CNN framework for FSR called ELSFace , which can sufficiently integrate the advantages of CNN and Transformer. The framework consists of a Feature Preparation Stage and a Feature Carving Stage. Basic facial contours and textures are generated in the Feature Preparation Stage, and separately guided by latent styles, so that facial details are better represented in reconstruction. CNN and Transformer streams in the Feature Carving Stage are used to individually restore facial textures and facial contours, respectively in a parallel recursive way. Considering the negligence of high-frequency features when learning the long-range dependencies, we design the High-Frequency Enhancement Block (HFEB) in the Transformer stream. The Sharp Loss is also proposed for better perceptual quality in optimization. Extensive experimental results demonstrate that our ELSFace can achieve the best results among all metrics compared to the state-of-the-art CNN and Transformer-based methods on commonly used datasets and real-world tasks. Meanwhile, our ELSFace method has the least model parameters and running time. The codes are released at https://github.com/FVL2020/ELSFace .
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2023.3283856