Sample-based collection and adjustment of rules for metadata extraction in business documents

Toward facile introduction of metadata‐based document management systems, we propose an algorithm which uses sample documents and their manually specified metadata as training data, and generates metadata‐extraction rules. Our algorithm enumerates candidates of keywords and layout characteristics sp...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Electronics and communications in Japan 2012-06, Vol.95 (6), p.1-11
Hauptverfasser: Matsumoto, Toshiko, Oba, Mitsuharu, Onoyama, Takashi, Akiyoshi, Masanori
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Toward facile introduction of metadata‐based document management systems, we propose an algorithm which uses sample documents and their manually specified metadata as training data, and generates metadata‐extraction rules. Our algorithm enumerates candidates of keywords and layout characteristics specific to the metadata on the basis of metadata occurrence in the training data. It then examines whether each candidate is specific to only one kind of metadata. In an experiment on Japanese business documents and weekly reports, automatically generated rules achieved metadata extraction as accurate as manually adjusted ones. © 2012 Wiley Periodicals, Inc. Electron Comm Jpn, 95(6): 1–11, 2012; Published online in Wiley Online Library (wileyonlinelibrary.com). DOI 10.1002/ecj.11373
ISSN:1942-9533
1942-9541
DOI:10.1002/ecj.11373