All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction
Extracting key information from scientific papers has the potential to help researchers work more efficiently and accelerate the pace of scientific progress. Over the last few years, research on Scientific Information Extraction (SciIE) witnessed the release of several new systems and benchmarks. Ho...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Li, Yuhan Wu, Jian Yu, Zhiwei Karlsson, Börje F Shen, Wei Okumura, Manabu Lin, Chin-Yew |
description | Extracting key information from scientific papers has the potential to help
researchers work more efficiently and accelerate the pace of scientific
progress. Over the last few years, research on Scientific Information
Extraction (SciIE) witnessed the release of several new systems and benchmarks.
However, existing paper-focused datasets mostly focus only on specific parts of
a manuscript (e.g., abstracts) and are single-modality (i.e., text- or
table-only), due to complex processing and expensive annotations. Moreover,
core information can be present in either text or tables or across both. To
close this gap in data availability and enable cross-modality IE, while
alleviating labeling costs, we propose a semi-supervised pipeline for
annotating entities in text, as well as entities and relations in tables, in an
iterative procedure. Based on this pipeline, we release novel resources for the
scientific community, including a high-quality benchmark, a large-scale corpus,
and a semi-supervised annotation pipeline. We further report the performance of
state-of-the-art IE models on the proposed benchmark dataset, as a baseline.
Lastly, we explore the potential capability of large language models such as
ChatGPT for the current task. Our new dataset, results, and analysis validate
the effectiveness and efficiency of our semi-supervised pipeline, and we
discuss its remaining limitations. |
doi_str_mv | 10.48550/arxiv.2311.08189 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_08189</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_08189</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-f4d76796c335d2e9cee8d929e545495d01c1cc3bd0b9cd88b64769c9b6e803713</originalsourceid><addsrcrecordid>eNotj8FOwzAQRH3hgAofwIn9gQQ7jhObWwkFKhU4kHvkrDfUIk2QY1Xt35O2nGakkZ7mMXYneJprpfiDDQe_TzMpRMq10OaafS_7Hp5ttDAOELcEtW17eoSPcU-XYaIIdnDwRANudzb8QDcGqMI4Tcn76Gzv4xG-0NMQfecR1sO872z0M3B1iMHiqd6wq872E93-54LVL6u6eks2n6_rarlJbFGapMtdOWeBUiqXkUEi7UxmSOUqN8pxgQJRto63Bp3WbZGXhUHTFqS5LIVcsPsL9mza_AY_Pz42J-PmbCz_ALWuUVQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction</title><source>arXiv.org</source><creator>Li, Yuhan ; Wu, Jian ; Yu, Zhiwei ; Karlsson, Börje F ; Shen, Wei ; Okumura, Manabu ; Lin, Chin-Yew</creator><creatorcontrib>Li, Yuhan ; Wu, Jian ; Yu, Zhiwei ; Karlsson, Börje F ; Shen, Wei ; Okumura, Manabu ; Lin, Chin-Yew</creatorcontrib><description>Extracting key information from scientific papers has the potential to help
researchers work more efficiently and accelerate the pace of scientific
progress. Over the last few years, research on Scientific Information
Extraction (SciIE) witnessed the release of several new systems and benchmarks.
However, existing paper-focused datasets mostly focus only on specific parts of
a manuscript (e.g., abstracts) and are single-modality (i.e., text- or
table-only), due to complex processing and expensive annotations. Moreover,
core information can be present in either text or tables or across both. To
close this gap in data availability and enable cross-modality IE, while
alleviating labeling costs, we propose a semi-supervised pipeline for
annotating entities in text, as well as entities and relations in tables, in an
iterative procedure. Based on this pipeline, we release novel resources for the
scientific community, including a high-quality benchmark, a large-scale corpus,
and a semi-supervised annotation pipeline. We further report the performance of
state-of-the-art IE models on the proposed benchmark dataset, as a baseline.
Lastly, we explore the potential capability of large language models such as
ChatGPT for the current task. Our new dataset, results, and analysis validate
the effectiveness and efficiency of our semi-supervised pipeline, and we
discuss its remaining limitations.</description><identifier>DOI: 10.48550/arxiv.2311.08189</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2023-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.08189$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.08189$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Yuhan</creatorcontrib><creatorcontrib>Wu, Jian</creatorcontrib><creatorcontrib>Yu, Zhiwei</creatorcontrib><creatorcontrib>Karlsson, Börje F</creatorcontrib><creatorcontrib>Shen, Wei</creatorcontrib><creatorcontrib>Okumura, Manabu</creatorcontrib><creatorcontrib>Lin, Chin-Yew</creatorcontrib><title>All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction</title><description>Extracting key information from scientific papers has the potential to help
researchers work more efficiently and accelerate the pace of scientific
progress. Over the last few years, research on Scientific Information
Extraction (SciIE) witnessed the release of several new systems and benchmarks.
However, existing paper-focused datasets mostly focus only on specific parts of
a manuscript (e.g., abstracts) and are single-modality (i.e., text- or
table-only), due to complex processing and expensive annotations. Moreover,
core information can be present in either text or tables or across both. To
close this gap in data availability and enable cross-modality IE, while
alleviating labeling costs, we propose a semi-supervised pipeline for
annotating entities in text, as well as entities and relations in tables, in an
iterative procedure. Based on this pipeline, we release novel resources for the
scientific community, including a high-quality benchmark, a large-scale corpus,
and a semi-supervised annotation pipeline. We further report the performance of
state-of-the-art IE models on the proposed benchmark dataset, as a baseline.
Lastly, we explore the potential capability of large language models such as
ChatGPT for the current task. Our new dataset, results, and analysis validate
the effectiveness and efficiency of our semi-supervised pipeline, and we
discuss its remaining limitations.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FOwzAQRH3hgAofwIn9gQQ7jhObWwkFKhU4kHvkrDfUIk2QY1Xt35O2nGakkZ7mMXYneJprpfiDDQe_TzMpRMq10OaafS_7Hp5ttDAOELcEtW17eoSPcU-XYaIIdnDwRANudzb8QDcGqMI4Tcn76Gzv4xG-0NMQfecR1sO872z0M3B1iMHiqd6wq872E93-54LVL6u6eks2n6_rarlJbFGapMtdOWeBUiqXkUEi7UxmSOUqN8pxgQJRto63Bp3WbZGXhUHTFqS5LIVcsPsL9mza_AY_Pz42J-PmbCz_ALWuUVQ</recordid><startdate>20231114</startdate><enddate>20231114</enddate><creator>Li, Yuhan</creator><creator>Wu, Jian</creator><creator>Yu, Zhiwei</creator><creator>Karlsson, Börje F</creator><creator>Shen, Wei</creator><creator>Okumura, Manabu</creator><creator>Lin, Chin-Yew</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231114</creationdate><title>All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction</title><author>Li, Yuhan ; Wu, Jian ; Yu, Zhiwei ; Karlsson, Börje F ; Shen, Wei ; Okumura, Manabu ; Lin, Chin-Yew</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-f4d76796c335d2e9cee8d929e545495d01c1cc3bd0b9cd88b64769c9b6e803713</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Yuhan</creatorcontrib><creatorcontrib>Wu, Jian</creatorcontrib><creatorcontrib>Yu, Zhiwei</creatorcontrib><creatorcontrib>Karlsson, Börje F</creatorcontrib><creatorcontrib>Shen, Wei</creatorcontrib><creatorcontrib>Okumura, Manabu</creatorcontrib><creatorcontrib>Lin, Chin-Yew</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Yuhan</au><au>Wu, Jian</au><au>Yu, Zhiwei</au><au>Karlsson, Börje F</au><au>Shen, Wei</au><au>Okumura, Manabu</au><au>Lin, Chin-Yew</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction</atitle><date>2023-11-14</date><risdate>2023</risdate><abstract>Extracting key information from scientific papers has the potential to help
researchers work more efficiently and accelerate the pace of scientific
progress. Over the last few years, research on Scientific Information
Extraction (SciIE) witnessed the release of several new systems and benchmarks.
However, existing paper-focused datasets mostly focus only on specific parts of
a manuscript (e.g., abstracts) and are single-modality (i.e., text- or
table-only), due to complex processing and expensive annotations. Moreover,
core information can be present in either text or tables or across both. To
close this gap in data availability and enable cross-modality IE, while
alleviating labeling costs, we propose a semi-supervised pipeline for
annotating entities in text, as well as entities and relations in tables, in an
iterative procedure. Based on this pipeline, we release novel resources for the
scientific community, including a high-quality benchmark, a large-scale corpus,
and a semi-supervised annotation pipeline. We further report the performance of
state-of-the-art IE models on the proposed benchmark dataset, as a baseline.
Lastly, we explore the potential capability of large language models such as
ChatGPT for the current task. Our new dataset, results, and analysis validate
the effectiveness and efficiency of our semi-supervised pipeline, and we
discuss its remaining limitations.</abstract><doi>10.48550/arxiv.2311.08189</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2311.08189 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2311_08189 |
source | arXiv.org |
subjects | Computer Science - Computation and Language |
title | All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T13%3A45%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=All%20Data%20on%20the%20Table:%20Novel%20Dataset%20and%20Benchmark%20for%20Cross-Modality%20Scientific%20Information%20Extraction&rft.au=Li,%20Yuhan&rft.date=2023-11-14&rft_id=info:doi/10.48550/arxiv.2311.08189&rft_dat=%3Carxiv_GOX%3E2311_08189%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |