CTRNet++: Dual-path Learning with Local-global Context Modeling for Scene Text Removal

Recent advances in scene text removal have attracted growing research interest due to its applications on privacy protection, document restoration, and text editing. While deep learning and generative adversarial network have shown significant progress, existing methods often struggle to generate co...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on multimedia computing communications and applications 2024-10
Hauptverfasser:	Liu, Chongyu, Peng, Dezhi, Liu, Yuliang, Jin, Lianwen
Format:	Artikel
Sprache:	eng
Schlagworte:	Computing methodologies Image processing
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Recent advances in scene text removal have attracted growing research interest due to its applications on privacy protection, document restoration, and text editing. While deep learning and generative adversarial network have shown significant progress, existing methods often struggle to generate consistent and plausible textures when erasing texts on complex backgrounds. To address this challenge, we propose a Contextual-guided Text Removal Network (CTRNet). CTRNet utilizes Low-level/High-level Contextual Guidance blocks (LCG, HCG) to explore both low-level structure and high-level discriminative context features from existing data to guide the text erasure and background restoration process. We further extend CTRNet to CTRNet++ by incorporate an Auto-Encoder architecture as a novel and effective HCG block, which serves as an additional image inpainting branch, providing more accurate texture and context clues with the assiatance of a large volume of natural images. Then we introduce a Context Embedding and Content Feature Modeling (CECFM) block that combines Depth-wise CNN and Transformer layers to capture local features and establish long-term relationships among pixels globally. In addition, an efficient Progressive Feature Fusion Module (PFFM) is proposed to fully utilize multi-scale features from different branches. Experiments on benchmark datasets, SCUT-EnsText and SCUT-Syn, demonstrate that CTRNet++ significantly outperforms existing state-of-the-art methods and exhibits a stronger ability for complex background reconstruction. The code is available at https://github.com/lcy0604/CTRNet-plus.
ISSN:	1551-6857 1551-6865
DOI:	10.1145/3697837