LineSeg: line segmentation of scanned newspaper documents

Segmentation is a significant stage for the recognition of old newspapers. Text-line extraction in the documents like newspaper pages which have very complex layouts poses a significant challenge. Old newspaper documents printed in Gurumukhi script present several forms of hurdles in segmentation du...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern analysis and applications : PAA 2022-02, Vol.25 (1), p.189-208
Hauptverfasser: Kaur, Rupinder Pal, Jindal, M. K., Kumar, Munish, Jindal, Simpel Rani, Tuteja, Shikha
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Segmentation is a significant stage for the recognition of old newspapers. Text-line extraction in the documents like newspaper pages which have very complex layouts poses a significant challenge. Old newspaper documents printed in Gurumukhi script present several forms of hurdles in segmentation due to noise, degradation, bleed-through of ink, multiple font styles and sizes, little space between neighboring text lines, overlapping of lines, etc. Because of the low quality and the complexity of these documents, automatic text line segmentation remains an open research field. Very few researches are available in the literature to segment news articles in Gurumukhi script. This is one of the first few attempts to recognize Gurumukhi newspaper text. The goal of this paper is to present a new methodology for text-line extraction by integrating median calculation and strip height calculation techniques. Non-suitability of existing techniques to segment newspaper text lines have also been discussed with results in the article. The efficiency of the proposed algorithm is demonstrated by experimentation directed on two diverse own made datasets: (a) on the data set of single-column documents with headlines block (b) on the dataset of multi-column documents with headlines block.
ISSN:1433-7541
1433-755X
DOI:10.1007/s10044-021-01031-6