Layout and Content Extraction for PDF Documents

Portable document format (PDF) is a common output format for electronic documents. Most PDF documents are untagged and do not have basic high-level document logical structural information, which makes the reuse or modification of the documents difficult. We developed techniques that identified logic...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Chao, Hui, Fan, Jian
Format:	Buchkapitel
Sprache:	eng
Schlagworte:	Applied sciences Computer science control theory systems Data processing. List processing. Character string processing Document Image Exact sciences and technology Memory organisation. Data processing Portable Document Format Software Text Block Text Line Text Segment
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Portable document format (PDF) is a common output format for electronic documents. Most PDF documents are untagged and do not have basic high-level document logical structural information, which makes the reuse or modification of the documents difficult. We developed techniques that identified logical components on a PDF document page. The outlines, style attributes and the contents of the logical components were extracted and expressed in an XML format. These techniques could facilitate the reuse and modification of the layout and the content of a PDF document page.
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-540-28640-0_20