CED: Catalog Extraction from Documents
Sentence-by-sentence information extraction from long documents is an exhausting and error-prone task. As the indicator of document skeleton, catalogs naturally chunk documents into segments and provide informative cascade semantics, which can help to reduce the search space. Despite their usefulnes...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Sentence-by-sentence information extraction from long documents is an
exhausting and error-prone task. As the indicator of document skeleton,
catalogs naturally chunk documents into segments and provide informative
cascade semantics, which can help to reduce the search space. Despite their
usefulness, catalogs are hard to be extracted without the assist from external
knowledge. For documents that adhere to a specific template, regular
expressions are practical to extract catalogs. However, handcrafted heuristics
are not applicable when processing documents from different sources with
diverse formats. To address this problem, we build a large manually annotated
corpus, which is the first dataset for the Catalog Extraction from Documents
(CED) task. Based on this corpus, we propose a transition-based framework for
parsing documents into catalog trees. The experimental results demonstrate that
our proposed method outperforms baseline systems and shows a good ability to
transfer. We believe the CED task could fill the gap between raw text segments
and information extraction tasks on extremely long documents. Data and code are
available at \url{https://github.com/Spico197/CatalogExtraction} |
---|---|
DOI: | 10.48550/arxiv.2304.14662 |