Comparison and Classification of Documents Based on Layout Similarity

This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information retrieval (Boston) 2000-05, Vol.2 (2-3), p.227-243
Hauptverfasser: Hu, Jianying, Ramanujan Kashi, Wilfong, Gordon
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors by capturing structural characteristics of the image. These fixed-length vectors are then compared to each other through a Manhattan distance computation for fast page layout comparison. The paper describes experiments and results to rank-order a set of document pages in terms of their layout similarity to a test document. We also demonstrate the usefulness of the features derived from interval coding in a hidden Markov model based page layout classification system that is trainable and extendible. The methods described in the paper can be used in various document retrieval tasks including visual similarity based retrieval, categorization and information extraction. [PUBLICATION ABSTRACT]
ISSN:1386-4564
1573-7659
DOI:10.1023/a:1009910911387