Summarization of Imaged Documents without OCR

A system is presented for creating a summary indicating the contents of an imaged document. The summary is composed from selected regions extracted from the imaged document. The regions may include sentences, key phrases, headings, and figures. The extracts are identified without the use of optical...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computer vision and image understanding 1998-06, Vol.70 (3), p.307-320
Hauptverfasser: Chen, Francine R, Bloomberg, Dan S
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:A system is presented for creating a summary indicating the contents of an imaged document. The summary is composed from selected regions extracted from the imaged document. The regions may include sentences, key phrases, headings, and figures. The extracts are identified without the use of optical character recognition. The imaged document is first processed to identify the word-bounding boxes, the reading order of words, and the location of sentence and paragraph boundaries in the text. The word-bounding boxes are grouped into equivalence classes to mimic the terms in a text document. Equivalence classes representing content words are identified, and key phrases are identified from the set of content words. Summary sentences are selected using a statistically based classifier applied to a set of discrete sentence features. Evaluation of sentence selection against a set of abstracts created by a professional abstracting company is given.
ISSN:1077-3142
1090-235X
DOI:10.1006/cviu.1998.0688