Sparse self-attention transformer for image inpainting

Learning-based image inpainting methods have made remarkable progress in recent years. Nevertheless, these methods still suffer from issues such as blurring, artifacts, and inconsistent contents. The use of vanilla convolution kernels, which have limited perceptual fields and spatially invariant ker...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition 2024-01, Vol.145, p.109897, Article 109897
Hauptverfasser: Huang, Wenli, Deng, Ye, Hui, Siqi, Wu, Yang, Zhou, Sanping, Wang, Jinjun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Learning-based image inpainting methods have made remarkable progress in recent years. Nevertheless, these methods still suffer from issues such as blurring, artifacts, and inconsistent contents. The use of vanilla convolution kernels, which have limited perceptual fields and spatially invariant kernel coefficients, is one of the main causes for these problems. In contrast, the multi-headed attention in the transformer can effectively model non-local relations among input features by generating adaptive attention scores. Therefore, this paper explores the feasibility of employing the transformer model for the image inpainting task. However, the multi-headed attention transformer blocks pose a significant challenge due to their overwhelming computational cost. To address this issue, we propose a novel U-Net style transformer-based network for the inpainting task, called the sparse self-attention transformer (Spa-former). The Spa-former retains the long-range modeling capacity of transformer blocks while reducing the computational burden. It incorporates a new channel attention approximation algorithm that reduces attention calculation to linear complexity. Additionally, it replaces the canonical softmax function with the ReLU function to generate a sparse attention map that effectively excludes irrelevant features. As a result, the Spa-former achieves effective long-range feature modeling with fewer parameters and lower computational resources. Our empirical results on challenging benchmarks demonstrate the superior performance of our proposed Spa-former over state-of-the-art approaches. •To accommodate the long-range modeling capacity of transformer blocks while reducing the computational burden, we introduce a novel U-Net style transformer-based network, called sparse self-attention transformer (Spa-former), to approach the inpainting task.•A transformer block to consider channel attention is adopted to model the global pixel relationships.•We adopt the ReLU function as the activation function to obtain a sparse attention/feature map, where coefficients with low/no correlation are removed from the attention map.•Experiments on challenging benchmarks demonstrate the superior performance of our Spa-former over state-of-the-art approaches.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2023.109897