Extracting information from handwritten content in census forms

In this paper, we describe our approach for extracting salient information from US census form images. These forms present several challenges including variations in individual form templates, skew, writing device, writing style, etc. We describe an innovative registration algorithm that is robust t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Huaigu Cao, Subramanian, K., Xujun Peng, Jinying Chen, Prasad, R., Natarajan, P.
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In this paper, we describe our approach for extracting salient information from US census form images. These forms present several challenges including variations in individual form templates, skew, writing device, writing style, etc. We describe an innovative registration algorithm that is robust to scale variations for segmenting the input image into cells. Following registration, the borders of cells are removed using a shape-based rule-line removal algorithm to extract handwritten content from each cell. Finally, the individual cell images are recognized using a hidden Markov model (HMM) OCR system with language models biased for the type of information in the cell, such as person name, place name, numbers, marital status, gender, race, etc.
ISSN:1051-4651
2831-7475