Sample-based collection and adjustment of rules for metadata extraction in business documents
Toward facile introduction of metadata‐based document management systems, we propose an algorithm which uses sample documents and their manually specified metadata as training data, and generates metadata‐extraction rules. Our algorithm enumerates candidates of keywords and layout characteristics sp...
Gespeichert in:
Veröffentlicht in: | Electronics and communications in Japan 2012-06, Vol.95 (6), p.1-11 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Toward facile introduction of metadata‐based document management systems, we propose an algorithm which uses sample documents and their manually specified metadata as training data, and generates metadata‐extraction rules. Our algorithm enumerates candidates of keywords and layout characteristics specific to the metadata on the basis of metadata occurrence in the training data. It then examines whether each candidate is specific to only one kind of metadata. In an experiment on Japanese business documents and weekly reports, automatically generated rules achieved metadata extraction as accurate as manually adjusted ones. © 2012 Wiley Periodicals, Inc. Electron Comm Jpn, 95(6): 1–11, 2012; Published online in Wiley Online Library (wileyonlinelibrary.com). DOI 10.1002/ecj.11373 |
---|---|
ISSN: | 1942-9533 1942-9541 |
DOI: | 10.1002/ecj.11373 |