Method for document comparison and classification using document image layout

The present invention relates generally to document processing. Specifically a new method is taught using document layout data to compare, and/or classify documents by type. Document type comparison and classification using layout classification is accomplished by first segmenting a document page in...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Hu, Jianying, Kashi, Ramanujan S, Wilfong, Gordon Thomas
Format:	Patent
Sprache:	eng
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The present invention relates generally to document processing. Specifically a new method is taught using document layout data to compare, and/or classify documents by type. Document type comparison and classification using layout classification is accomplished by first segmenting a document page into blocks of text and white space. A grid of rows and columns, forming bins, is created on the page to intersect the blocks. Layout information is identified using a unique fixed length interval vector, to represent each row on the segmented document. By computing the Manhattan distance between interval vectors of all rows of two document pages and performing a warping function to determine the row to row correspondence, two documents may be compared by their layout. Furthermore, interval vectors may be grouped into N clusters with a cluster center, defined as the median of the interval vectors of the cluster, replacing each interval vector in its cluster. Using Hidden Markov Models, documents can be compared to document type models comprising rows represented by cluster centers and identified as belonging to one or more document types. In addition, documents stored in a database may be retrieved, deleted, or otherwise managed by type, using their corresponding vector sets without requiring expensive OCR of the document. Furthermore, based on the classification, it is a simple matter to locate which blocks of data contain certain information. Where only that information is desired, it is not necessary to perform OCR on the entire document. Rather OCR may be limited to those blocks where the particular information is expected based on the document type.