T‐Skeleton: Accurate scene text detection via instance‐aware skeleton embedding

Existing segmentation‐based methods have made considerable progress in arbitrarily shaped text detection due to the advantage of dealing with shape variation. However, there still exist challenges to detecting accurate text instances with dense layouts, inaccurate annotations, and complex background...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IET Image Processing 2024-05, Vol.18 (6), p.1491-1503
Hauptverfasser: Li, Haiyan, Hu, Xingfei, Lu, Hongtao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Existing segmentation‐based methods have made considerable progress in arbitrarily shaped text detection due to the advantage of dealing with shape variation. However, there still exist challenges to detecting accurate text instances with dense layouts, inaccurate annotations, and complex backgrounds. Many recent works have focused on improving arbitrary boundary prediction, but it may be difficult to accurately distinguish each instance of dense layouts because their boundary pixels may be mistakenly classified to produce inaccurate results (i.e., adhesive texts) with inaccurate annotation and complex backgrounds. Considering the local and long‐range dependencies, this paper proposes an efficient text detector, namely T‐Skeleton, to obtain more reliable segmentation detections. In the spirit of object skeletonization, we introduce the text instance skeleton highlighting the semantically significant structure (similar to the skeleton of a fish) to explicitly capture the long‐range dependencies of text instances. The key idea of T‐Skeleton is to calibrate the coarse text proposals by embedding text instance skeletons to separate crowd texts accurately and robustly. We further design a channel attention module to enlarge the performance margin between T‐Skeleton and the segmentation baseline. Experimental results on four publicly available datasets show the superiority of T‐Skeleton in handling long and curved texts. Text detection only relying on local context is not robust to separate crowd texts accurately due to the challenges exist in texts dense layouts, inaccurate annotations and complex backgrounds. T‐Skeleton explicitly exploits long‐range dependencies to address the challenges by extracting text skeleton, which is the distances transformed between text pixels and their nearest boundary pixels. T‐Skeleton has great representation capacity and distinguish long and curved texts well.
ISSN:1751-9659
1751-9667
DOI:10.1049/ipr2.13043