All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction

Extracting key information from scientific papers has the potential to help researchers work more efficiently and accelerate the pace of scientific progress. Over the last few years, research on Scientific Information Extraction (SciIE) witnessed the release of several new systems and benchmarks. Ho...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Li, Yuhan, Wu, Jian, Yu, Zhiwei, Karlsson, Börje F, Shen, Wei, Okumura, Manabu, Lin, Chin-Yew
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Li, Yuhan Wu, Jian Yu, Zhiwei Karlsson, Börje F Shen, Wei Okumura, Manabu Lin, Chin-Yew
description	Extracting key information from scientific papers has the potential to help researchers work more efficiently and accelerate the pace of scientific progress. Over the last few years, research on Scientific Information Extraction (SciIE) witnessed the release of several new systems and benchmarks. However, existing paper-focused datasets mostly focus only on specific parts of a manuscript (e.g., abstracts) and are single-modality (i.e., text- or table-only), due to complex processing and expensive annotations. Moreover, core information can be present in either text or tables or across both. To close this gap in data availability and enable cross-modality IE, while alleviating labeling costs, we propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. Based on this pipeline, we release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline. We further report the performance of state-of-the-art IE models on the proposed benchmark dataset, as a baseline. Lastly, we explore the potential capability of large language models such as ChatGPT for the current task. Our new dataset, results, and analysis validate the effectiveness and efficiency of our semi-supervised pipeline, and we discuss its remaining limitations.
doi_str_mv	10.48550/arxiv.2311.08189
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_08189</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_08189</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-f4d76796c335d2e9cee8d929e545495d01c1cc3bd0b9cd88b64769c9b6e803713</originalsourceid><addsrcrecordid>eNotj8FOwzAQRH3hgAofwIn9gQQ7jhObWwkFKhU4kHvkrDfUIk2QY1Xt35O2nGakkZ7mMXYneJprpfiDDQe_TzMpRMq10OaafS_7Hp5ttDAOELcEtW17eoSPcU-XYaIIdnDwRANudzb8QDcGqMI4Tcn76Gzv4xG-0NMQfecR1sO872z0M3B1iMHiqd6wq872E93-54LVL6u6eks2n6_rarlJbFGapMtdOWeBUiqXkUEi7UxmSOUqN8pxgQJRto63Bp3WbZGXhUHTFqS5LIVcsPsL9mza_AY_Pz42J-PmbCz_ALWuUVQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction</title><source>arXiv.org</source><creator>Li, Yuhan ; Wu, Jian ; Yu, Zhiwei ; Karlsson, Börje F ; Shen, Wei ; Okumura, Manabu ; Lin, Chin-Yew</creator><creatorcontrib>Li, Yuhan ; Wu, Jian ; Yu, Zhiwei ; Karlsson, Börje F ; Shen, Wei ; Okumura, Manabu ; Lin, Chin-Yew</creatorcontrib><description>Extracting key information from scientific papers has the potential to help researchers work more efficiently and accelerate the pace of scientific progress. Over the last few years, research on Scientific Information Extraction (SciIE) witnessed the release of several new systems and benchmarks. However, existing paper-focused datasets mostly focus only on specific parts of a manuscript (e.g., abstracts) and are single-modality (i.e., text- or table-only), due to complex processing and expensive annotations. Moreover, core information can be present in either text or tables or across both. To close this gap in data availability and enable cross-modality IE, while alleviating labeling costs, we propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. Based on this pipeline, we release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline. We further report the performance of state-of-the-art IE models on the proposed benchmark dataset, as a baseline. Lastly, we explore the potential capability of large language models such as ChatGPT for the current task. Our new dataset, results, and analysis validate the effectiveness and efficiency of our semi-supervised pipeline, and we discuss its remaining limitations.</description><identifier>DOI: 10.48550/arxiv.2311.08189</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2023-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.08189$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.08189$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Yuhan</creatorcontrib><creatorcontrib>Wu, Jian</creatorcontrib><creatorcontrib>Yu, Zhiwei</creatorcontrib><creatorcontrib>Karlsson, Börje F</creatorcontrib><creatorcontrib>Shen, Wei</creatorcontrib><creatorcontrib>Okumura, Manabu</creatorcontrib><creatorcontrib>Lin, Chin-Yew</creatorcontrib><title>All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction</title><description>Extracting key information from scientific papers has the potential to help researchers work more efficiently and accelerate the pace of scientific progress. Over the last few years, research on Scientific Information Extraction (SciIE) witnessed the release of several new systems and benchmarks. However, existing paper-focused datasets mostly focus only on specific parts of a manuscript (e.g., abstracts) and are single-modality (i.e., text- or table-only), due to complex processing and expensive annotations. Moreover, core information can be present in either text or tables or across both. To close this gap in data availability and enable cross-modality IE, while alleviating labeling costs, we propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. Based on this pipeline, we release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline. We further report the performance of state-of-the-art IE models on the proposed benchmark dataset, as a baseline. Lastly, we explore the potential capability of large language models such as ChatGPT for the current task. Our new dataset, results, and analysis validate the effectiveness and efficiency of our semi-supervised pipeline, and we discuss its remaining limitations.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FOwzAQRH3hgAofwIn9gQQ7jhObWwkFKhU4kHvkrDfUIk2QY1Xt35O2nGakkZ7mMXYneJprpfiDDQe_TzMpRMq10OaafS_7Hp5ttDAOELcEtW17eoSPcU-XYaIIdnDwRANudzb8QDcGqMI4Tcn76Gzv4xG-0NMQfecR1sO872z0M3B1iMHiqd6wq872E93-54LVL6u6eks2n6_rarlJbFGapMtdOWeBUiqXkUEi7UxmSOUqN8pxgQJRto63Bp3WbZGXhUHTFqS5LIVcsPsL9mza_AY_Pz42J-PmbCz_ALWuUVQ</recordid><startdate>20231114</startdate><enddate>20231114</enddate><creator>Li, Yuhan</creator><creator>Wu, Jian</creator><creator>Yu, Zhiwei</creator><creator>Karlsson, Börje F</creator><creator>Shen, Wei</creator><creator>Okumura, Manabu</creator><creator>Lin, Chin-Yew</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231114</creationdate><title>All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction</title><author>Li, Yuhan ; Wu, Jian ; Yu, Zhiwei ; Karlsson, Börje F ; Shen, Wei ; Okumura, Manabu ; Lin, Chin-Yew</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-f4d76796c335d2e9cee8d929e545495d01c1cc3bd0b9cd88b64769c9b6e803713</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Yuhan</creatorcontrib><creatorcontrib>Wu, Jian</creatorcontrib><creatorcontrib>Yu, Zhiwei</creatorcontrib><creatorcontrib>Karlsson, Börje F</creatorcontrib><creatorcontrib>Shen, Wei</creatorcontrib><creatorcontrib>Okumura, Manabu</creatorcontrib><creatorcontrib>Lin, Chin-Yew</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Yuhan</au><au>Wu, Jian</au><au>Yu, Zhiwei</au><au>Karlsson, Börje F</au><au>Shen, Wei</au><au>Okumura, Manabu</au><au>Lin, Chin-Yew</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction</atitle><date>2023-11-14</date><risdate>2023</risdate><abstract>Extracting key information from scientific papers has the potential to help researchers work more efficiently and accelerate the pace of scientific progress. Over the last few years, research on Scientific Information Extraction (SciIE) witnessed the release of several new systems and benchmarks. However, existing paper-focused datasets mostly focus only on specific parts of a manuscript (e.g., abstracts) and are single-modality (i.e., text- or table-only), due to complex processing and expensive annotations. Moreover, core information can be present in either text or tables or across both. To close this gap in data availability and enable cross-modality IE, while alleviating labeling costs, we propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. Based on this pipeline, we release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline. We further report the performance of state-of-the-art IE models on the proposed benchmark dataset, as a baseline. Lastly, we explore the potential capability of large language models such as ChatGPT for the current task. Our new dataset, results, and analysis validate the effectiveness and efficiency of our semi-supervised pipeline, and we discuss its remaining limitations.</abstract><doi>10.48550/arxiv.2311.08189</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2311.08189
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2311_08189
source	arXiv.org
subjects	Computer Science - Computation and Language
title	All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T13%3A45%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=All%20Data%20on%20the%20Table:%20Novel%20Dataset%20and%20Benchmark%20for%20Cross-Modality%20Scientific%20Information%20Extraction&rft.au=Li,%20Yuhan&rft.date=2023-11-14&rft_id=info:doi/10.48550/arxiv.2311.08189&rft_dat=%3Carxiv_GOX%3E2311_08189%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true