All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction
Extracting key information from scientific papers has the potential to help researchers work more efficiently and accelerate the pace of scientific progress. Over the last few years, research on Scientific Information Extraction (SciIE) witnessed the release of several new systems and benchmarks. Ho...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Extracting key information from scientific papers has the potential to help
researchers work more efficiently and accelerate the pace of scientific
progress. Over the last few years, research on Scientific Information
Extraction (SciIE) witnessed the release of several new systems and benchmarks.
However, existing paper-focused datasets mostly focus only on specific parts of
a manuscript (e.g., abstracts) and are single-modality (i.e., text- or
table-only), due to complex processing and expensive annotations. Moreover,
core information can be present in either text or tables or across both. To
close this gap in data availability and enable cross-modality IE, while
alleviating labeling costs, we propose a semi-supervised pipeline for
annotating entities in text, as well as entities and relations in tables, in an
iterative procedure. Based on this pipeline, we release novel resources for the
scientific community, including a high-quality benchmark, a large-scale corpus,
and a semi-supervised annotation pipeline. We further report the performance of
state-of-the-art IE models on the proposed benchmark dataset, as a baseline.
Lastly, we explore the potential capability of large language models such as
ChatGPT for the current task. Our new dataset, results, and analysis validate
the effectiveness and efficiency of our semi-supervised pipeline, and we
discuss its remaining limitations. |
---|---|
DOI: | 10.48550/arxiv.2311.08189 |