Word embeddings for retrieving tabular data from research publications

Scientists face challenges when finding datasets related to their research problems due to the limitations of current dataset search engines. Existing tools for searching research datasets rely on publication content or metadata, do not considering the data contained in the publication in the form o...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Machine learning 2024-04, Vol.113 (4), p.2227-2248
Hauptverfasser:	Berenguer, Alberto, Mazón, Jose-Norberto, Tomás, David
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence Computer Science Control Datasets Machine Learning Mechatronics Natural Language Processing (NLP) Robotics Scientists Search engines Simulation and Modeling Special Issue on Discovery Science 2022 Tables (data) Words (language)
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	2248
container_issue	4
container_start_page	2227
container_title	Machine learning
container_volume	113
creator	Berenguer, Alberto Mazón, Jose-Norberto Tomás, David
description	Scientists face challenges when finding datasets related to their research problems due to the limitations of current dataset search engines. Existing tools for searching research datasets rely on publication content or metadata, do not considering the data contained in the publication in the form of tables. Moreover, scientists require more elaborate inputs and functionalities to retrieve different parts of an article, such as data presented in tables, based on their search purposes. Therefore, this paper proposes a novel approach to retrieve relevant tabular datasets from publications. The input of our system is a research problem stated as an abstract from a scientific paper, and the output is a set of relevant tables from publications that are related to the research problem. This approach aims to provide a better solution for scientists to find useful datasets that support them in addressing their research problems. To validate this approach, experiments were conducted using word embedding from different language models to calculate the semantic similarity between abstracts and tables. The results showed that contextual models significantly outperformed non-contextual models, especially when pre-trained with scientific data. Furthermore, the importance of context was found to be crucial for improving the results.
doi_str_mv	10.1007/s10994-023-06472-0
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3013903346</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3013903346</sourcerecordid><originalsourceid>FETCH-LOGICAL-c363t-ea04877b054c75d23b253e92e4b7a2fef67718bff1fd51491df7153596eb212d3</originalsourceid><addsrcrecordid>eNp9kMtKxDAUhoMoWEdfwFXAdfTk3i5lcFQYcKO4DEmTjB1m2jFpBd_eaAV3rg4_57_Ah9AlhWsKoG8yhaYRBBgnoIRmBI5QRaUuUip5jCqoa0kUZfIUneW8BQCmalWh1euQPA57F7zv-k3GcUg4hTF14aNoPFo37WzC3o4WxzTsyzMHm9o3fJjcrmvt2A19Pkcn0e5yuPi9C_SyuntePpD10_3j8nZNWq74SIIFUWvtQIpWS8-4Y5KHhgXhtGUxRKU1rV2MNHpJRUN91FRy2ajgGGWeL9DV3HtIw_sU8mi2w5T6Mmk4UN4A50IVF5tdbRpyTiGaQ-r2Nn0aCuabl5l5mcLL_PAyUEJ8DuVi7jch_VX_k_oCRQdthg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3013903346</pqid></control><display><type>article</type><title>Word embeddings for retrieving tabular data from research publications</title><source>SpringerLink Journals - AutoHoldings</source><creator>Berenguer, Alberto ; Mazón, Jose-Norberto ; Tomás, David</creator><creatorcontrib>Berenguer, Alberto ; Mazón, Jose-Norberto ; Tomás, David</creatorcontrib><description>Scientists face challenges when finding datasets related to their research problems due to the limitations of current dataset search engines. Existing tools for searching research datasets rely on publication content or metadata, do not considering the data contained in the publication in the form of tables. Moreover, scientists require more elaborate inputs and functionalities to retrieve different parts of an article, such as data presented in tables, based on their search purposes. Therefore, this paper proposes a novel approach to retrieve relevant tabular datasets from publications. The input of our system is a research problem stated as an abstract from a scientific paper, and the output is a set of relevant tables from publications that are related to the research problem. This approach aims to provide a better solution for scientists to find useful datasets that support them in addressing their research problems. To validate this approach, experiments were conducted using word embedding from different language models to calculate the semantic similarity between abstracts and tables. The results showed that contextual models significantly outperformed non-contextual models, especially when pre-trained with scientific data. Furthermore, the importance of context was found to be crucial for improving the results.</description><identifier>ISSN: 0885-6125</identifier><identifier>EISSN: 1573-0565</identifier><identifier>DOI: 10.1007/s10994-023-06472-0</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial Intelligence ; Computer Science ; Control ; Datasets ; Machine Learning ; Mechatronics ; Natural Language Processing (NLP) ; Robotics ; Scientists ; Search engines ; Simulation and Modeling ; Special Issue on Discovery Science 2022 ; Tables (data) ; Words (language)</subject><ispartof>Machine learning, 2024-04, Vol.113 (4), p.2227-2248</ispartof><rights>The Author(s) 2023</rights><rights>The Author(s) 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c363t-ea04877b054c75d23b253e92e4b7a2fef67718bff1fd51491df7153596eb212d3</citedby><cites>FETCH-LOGICAL-c363t-ea04877b054c75d23b253e92e4b7a2fef67718bff1fd51491df7153596eb212d3</cites><orcidid>0000-0002-2867-8329</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10994-023-06472-0$$EPDF$$P50$$Gspringer$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10994-023-06472-0$$EHTML$$P50$$Gspringer$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,27903,27904,41467,42536,51297</link.rule.ids></links><search><creatorcontrib>Berenguer, Alberto</creatorcontrib><creatorcontrib>Mazón, Jose-Norberto</creatorcontrib><creatorcontrib>Tomás, David</creatorcontrib><title>Word embeddings for retrieving tabular data from research publications</title><title>Machine learning</title><addtitle>Mach Learn</addtitle><description>Scientists face challenges when finding datasets related to their research problems due to the limitations of current dataset search engines. Existing tools for searching research datasets rely on publication content or metadata, do not considering the data contained in the publication in the form of tables. Moreover, scientists require more elaborate inputs and functionalities to retrieve different parts of an article, such as data presented in tables, based on their search purposes. Therefore, this paper proposes a novel approach to retrieve relevant tabular datasets from publications. The input of our system is a research problem stated as an abstract from a scientific paper, and the output is a set of relevant tables from publications that are related to the research problem. This approach aims to provide a better solution for scientists to find useful datasets that support them in addressing their research problems. To validate this approach, experiments were conducted using word embedding from different language models to calculate the semantic similarity between abstracts and tables. The results showed that contextual models significantly outperformed non-contextual models, especially when pre-trained with scientific data. Furthermore, the importance of context was found to be crucial for improving the results.</description><subject>Artificial Intelligence</subject><subject>Computer Science</subject><subject>Control</subject><subject>Datasets</subject><subject>Machine Learning</subject><subject>Mechatronics</subject><subject>Natural Language Processing (NLP)</subject><subject>Robotics</subject><subject>Scientists</subject><subject>Search engines</subject><subject>Simulation and Modeling</subject><subject>Special Issue on Discovery Science 2022</subject><subject>Tables (data)</subject><subject>Words (language)</subject><issn>0885-6125</issn><issn>1573-0565</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><recordid>eNp9kMtKxDAUhoMoWEdfwFXAdfTk3i5lcFQYcKO4DEmTjB1m2jFpBd_eaAV3rg4_57_Ah9AlhWsKoG8yhaYRBBgnoIRmBI5QRaUuUip5jCqoa0kUZfIUneW8BQCmalWh1euQPA57F7zv-k3GcUg4hTF14aNoPFo37WzC3o4WxzTsyzMHm9o3fJjcrmvt2A19Pkcn0e5yuPi9C_SyuntePpD10_3j8nZNWq74SIIFUWvtQIpWS8-4Y5KHhgXhtGUxRKU1rV2MNHpJRUN91FRy2ajgGGWeL9DV3HtIw_sU8mi2w5T6Mmk4UN4A50IVF5tdbRpyTiGaQ-r2Nn0aCuabl5l5mcLL_PAyUEJ8DuVi7jch_VX_k_oCRQdthg</recordid><startdate>20240401</startdate><enddate>20240401</enddate><creator>Berenguer, Alberto</creator><creator>Mazón, Jose-Norberto</creator><creator>Tomás, David</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-2867-8329</orcidid></search><sort><creationdate>20240401</creationdate><title>Word embeddings for retrieving tabular data from research publications</title><author>Berenguer, Alberto ; Mazón, Jose-Norberto ; Tomás, David</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c363t-ea04877b054c75d23b253e92e4b7a2fef67718bff1fd51491df7153596eb212d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial Intelligence</topic><topic>Computer Science</topic><topic>Control</topic><topic>Datasets</topic><topic>Machine Learning</topic><topic>Mechatronics</topic><topic>Natural Language Processing (NLP)</topic><topic>Robotics</topic><topic>Scientists</topic><topic>Search engines</topic><topic>Simulation and Modeling</topic><topic>Special Issue on Discovery Science 2022</topic><topic>Tables (data)</topic><topic>Words (language)</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Berenguer, Alberto</creatorcontrib><creatorcontrib>Mazón, Jose-Norberto</creatorcontrib><creatorcontrib>Tomás, David</creatorcontrib><collection>Springer Nature OA Free Journals</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Machine learning</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Berenguer, Alberto</au><au>Mazón, Jose-Norberto</au><au>Tomás, David</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Word embeddings for retrieving tabular data from research publications</atitle><jtitle>Machine learning</jtitle><stitle>Mach Learn</stitle><date>2024-04-01</date><risdate>2024</risdate><volume>113</volume><issue>4</issue><spage>2227</spage><epage>2248</epage><pages>2227-2248</pages><issn>0885-6125</issn><eissn>1573-0565</eissn><abstract>Scientists face challenges when finding datasets related to their research problems due to the limitations of current dataset search engines. Existing tools for searching research datasets rely on publication content or metadata, do not considering the data contained in the publication in the form of tables. Moreover, scientists require more elaborate inputs and functionalities to retrieve different parts of an article, such as data presented in tables, based on their search purposes. Therefore, this paper proposes a novel approach to retrieve relevant tabular datasets from publications. The input of our system is a research problem stated as an abstract from a scientific paper, and the output is a set of relevant tables from publications that are related to the research problem. This approach aims to provide a better solution for scientists to find useful datasets that support them in addressing their research problems. To validate this approach, experiments were conducted using word embedding from different language models to calculate the semantic similarity between abstracts and tables. The results showed that contextual models significantly outperformed non-contextual models, especially when pre-trained with scientific data. Furthermore, the importance of context was found to be crucial for improving the results.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s10994-023-06472-0</doi><tpages>22</tpages><orcidid>https://orcid.org/0000-0002-2867-8329</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0885-6125
ispartof	Machine learning, 2024-04, Vol.113 (4), p.2227-2248
issn	0885-6125 1573-0565
language	eng
recordid	cdi_proquest_journals_3013903346
source	SpringerLink Journals - AutoHoldings
subjects	Artificial Intelligence Computer Science Control Datasets Machine Learning Mechatronics Natural Language Processing (NLP) Robotics Scientists Search engines Simulation and Modeling Special Issue on Discovery Science 2022 Tables (data) Words (language)
title	Word embeddings for retrieving tabular data from research publications
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T05%3A54%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Word%20embeddings%20for%20retrieving%20tabular%20data%20from%20research%20publications&rft.jtitle=Machine%20learning&rft.au=Berenguer,%20Alberto&rft.date=2024-04-01&rft.volume=113&rft.issue=4&rft.spage=2227&rft.epage=2248&rft.pages=2227-2248&rft.issn=0885-6125&rft.eissn=1573-0565&rft_id=info:doi/10.1007/s10994-023-06472-0&rft_dat=%3Cproquest_cross%3E3013903346%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3013903346&rft_id=info:pmid/&rfr_iscdi=true