Knowledge Extraction from LLMs for Scalable Historical Data Annotation
This paper introduces a novel approach to extract knowledge from large language models and generate structured historical datasets. We investigate the feasibility and limitations of this technique by comparing the generated data against two human-annotated historical datasets spanning from 10,000 BC...
Gespeichert in:
Veröffentlicht in: | Electronics (Basel) 2024-12, Vol.13 (24), p.4990 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | 24 |
container_start_page | 4990 |
container_title | Electronics (Basel) |
container_volume | 13 |
creator | Celli, Fabio Mingazov, Dmitry |
description | This paper introduces a novel approach to extract knowledge from large language models and generate structured historical datasets. We investigate the feasibility and limitations of this technique by comparing the generated data against two human-annotated historical datasets spanning from 10,000 BC to 2000 CE. Our findings demonstrate that generative AI can successfully produce historical annotations for a wide range of variables, including political, economic, and social factors. However, the model’s performance varies across different regions, influenced by factors such as data granularity, historical complexity, and model limitations. We highlight the importance of high-quality instructions and effective prompt engineering to mitigate issues like hallucinations and improve the accuracy of generated annotations. The successful application of this technique can significantly accelerate the development of reliable structured historical datasets, with a potentially high impact on comparative and computational history. |
doi_str_mv | 10.3390/electronics13244990 |
format | Article |
fullrecord | <record><control><sourceid>gale_proqu</sourceid><recordid>TN_cdi_proquest_journals_3149599680</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A821763357</galeid><sourcerecordid>A821763357</sourcerecordid><originalsourceid>FETCH-LOGICAL-c196t-32a3c4ebc74480cc7b3743470b6c56c6539234b9bc0a48010e7cc578e8e9a4033</originalsourceid><addsrcrecordid>eNptUD1PwzAQtRBIVNBfwGKJucXOOXE8VqWliCAGYI6cq1O5Su1iuwL-Pa7KwMDdcB96753uEXLD2RRAsTszGEzBO4uRQyGEUuyMjAom1UQVqjj_01-ScYxblkNxqIGNyPLJ-c_BrDeGLr5S0Jisd7QPfkeb5jnS3gf6inrQ3WDoysbkg80jvddJ05lzPukj45pc9HqIZvxbr8j7cvE2X02al4fH-ayZIFdVmkChAYXpUApRM0TZgRQgJOsqLCusSlAFiE51yHQGcGYkYilrUxulBQO4Ircn3X3wHwcTU7v1h-DyyRa4UKVSVc0yanpCbfRgWut6f_ws59rsLHpnepv3s7rgsgIoZSbAiYDBxxhM3-6D3enw3XLWHk1u_zEZfgBXM3Eq</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3149599680</pqid></control><display><type>article</type><title>Knowledge Extraction from LLMs for Scalable Historical Data Annotation</title><source>MDPI - Multidisciplinary Digital Publishing Institute</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Celli, Fabio ; Mingazov, Dmitry</creator><creatorcontrib>Celli, Fabio ; Mingazov, Dmitry</creatorcontrib><description>This paper introduces a novel approach to extract knowledge from large language models and generate structured historical datasets. We investigate the feasibility and limitations of this technique by comparing the generated data against two human-annotated historical datasets spanning from 10,000 BC to 2000 CE. Our findings demonstrate that generative AI can successfully produce historical annotations for a wide range of variables, including political, economic, and social factors. However, the model’s performance varies across different regions, influenced by factors such as data granularity, historical complexity, and model limitations. We highlight the importance of high-quality instructions and effective prompt engineering to mitigate issues like hallucinations and improve the accuracy of generated annotations. The successful application of this technique can significantly accelerate the development of reliable structured historical datasets, with a potentially high impact on comparative and computational history.</description><identifier>ISSN: 2079-9292</identifier><identifier>EISSN: 2079-9292</identifier><identifier>DOI: 10.3390/electronics13244990</identifier><language>eng</language><publisher>Basel: MDPI AG</publisher><subject>Annotations ; Crowdsourcing ; Data integrity ; Datasets ; Generative artificial intelligence ; Knowledge representation ; Large language models ; Machine learning ; Prompt engineering ; Sparsity ; Statistical analysis ; Subjectivity</subject><ispartof>Electronics (Basel), 2024-12, Vol.13 (24), p.4990</ispartof><rights>COPYRIGHT 2024 MDPI AG</rights><rights>2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c196t-32a3c4ebc74480cc7b3743470b6c56c6539234b9bc0a48010e7cc578e8e9a4033</cites><orcidid>0000-0002-7309-5886</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Celli, Fabio</creatorcontrib><creatorcontrib>Mingazov, Dmitry</creatorcontrib><title>Knowledge Extraction from LLMs for Scalable Historical Data Annotation</title><title>Electronics (Basel)</title><description>This paper introduces a novel approach to extract knowledge from large language models and generate structured historical datasets. We investigate the feasibility and limitations of this technique by comparing the generated data against two human-annotated historical datasets spanning from 10,000 BC to 2000 CE. Our findings demonstrate that generative AI can successfully produce historical annotations for a wide range of variables, including political, economic, and social factors. However, the model’s performance varies across different regions, influenced by factors such as data granularity, historical complexity, and model limitations. We highlight the importance of high-quality instructions and effective prompt engineering to mitigate issues like hallucinations and improve the accuracy of generated annotations. The successful application of this technique can significantly accelerate the development of reliable structured historical datasets, with a potentially high impact on comparative and computational history.</description><subject>Annotations</subject><subject>Crowdsourcing</subject><subject>Data integrity</subject><subject>Datasets</subject><subject>Generative artificial intelligence</subject><subject>Knowledge representation</subject><subject>Large language models</subject><subject>Machine learning</subject><subject>Prompt engineering</subject><subject>Sparsity</subject><subject>Statistical analysis</subject><subject>Subjectivity</subject><issn>2079-9292</issn><issn>2079-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNptUD1PwzAQtRBIVNBfwGKJucXOOXE8VqWliCAGYI6cq1O5Su1iuwL-Pa7KwMDdcB96753uEXLD2RRAsTszGEzBO4uRQyGEUuyMjAom1UQVqjj_01-ScYxblkNxqIGNyPLJ-c_BrDeGLr5S0Jisd7QPfkeb5jnS3gf6inrQ3WDoysbkg80jvddJ05lzPukj45pc9HqIZvxbr8j7cvE2X02al4fH-ayZIFdVmkChAYXpUApRM0TZgRQgJOsqLCusSlAFiE51yHQGcGYkYilrUxulBQO4Ircn3X3wHwcTU7v1h-DyyRa4UKVSVc0yanpCbfRgWut6f_ws59rsLHpnepv3s7rgsgIoZSbAiYDBxxhM3-6D3enw3XLWHk1u_zEZfgBXM3Eq</recordid><startdate>20241201</startdate><enddate>20241201</enddate><creator>Celli, Fabio</creator><creator>Mingazov, Dmitry</creator><general>MDPI AG</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L7M</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><orcidid>https://orcid.org/0000-0002-7309-5886</orcidid></search><sort><creationdate>20241201</creationdate><title>Knowledge Extraction from LLMs for Scalable Historical Data Annotation</title><author>Celli, Fabio ; Mingazov, Dmitry</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c196t-32a3c4ebc74480cc7b3743470b6c56c6539234b9bc0a48010e7cc578e8e9a4033</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Annotations</topic><topic>Crowdsourcing</topic><topic>Data integrity</topic><topic>Datasets</topic><topic>Generative artificial intelligence</topic><topic>Knowledge representation</topic><topic>Large language models</topic><topic>Machine learning</topic><topic>Prompt engineering</topic><topic>Sparsity</topic><topic>Statistical analysis</topic><topic>Subjectivity</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Celli, Fabio</creatorcontrib><creatorcontrib>Mingazov, Dmitry</creatorcontrib><collection>CrossRef</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><jtitle>Electronics (Basel)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Celli, Fabio</au><au>Mingazov, Dmitry</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Knowledge Extraction from LLMs for Scalable Historical Data Annotation</atitle><jtitle>Electronics (Basel)</jtitle><date>2024-12-01</date><risdate>2024</risdate><volume>13</volume><issue>24</issue><spage>4990</spage><pages>4990-</pages><issn>2079-9292</issn><eissn>2079-9292</eissn><abstract>This paper introduces a novel approach to extract knowledge from large language models and generate structured historical datasets. We investigate the feasibility and limitations of this technique by comparing the generated data against two human-annotated historical datasets spanning from 10,000 BC to 2000 CE. Our findings demonstrate that generative AI can successfully produce historical annotations for a wide range of variables, including political, economic, and social factors. However, the model’s performance varies across different regions, influenced by factors such as data granularity, historical complexity, and model limitations. We highlight the importance of high-quality instructions and effective prompt engineering to mitigate issues like hallucinations and improve the accuracy of generated annotations. The successful application of this technique can significantly accelerate the development of reliable structured historical datasets, with a potentially high impact on comparative and computational history.</abstract><cop>Basel</cop><pub>MDPI AG</pub><doi>10.3390/electronics13244990</doi><orcidid>https://orcid.org/0000-0002-7309-5886</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2079-9292 |
ispartof | Electronics (Basel), 2024-12, Vol.13 (24), p.4990 |
issn | 2079-9292 2079-9292 |
language | eng |
recordid | cdi_proquest_journals_3149599680 |
source | MDPI - Multidisciplinary Digital Publishing Institute; EZB-FREE-00999 freely available EZB journals |
subjects | Annotations Crowdsourcing Data integrity Datasets Generative artificial intelligence Knowledge representation Large language models Machine learning Prompt engineering Sparsity Statistical analysis Subjectivity |
title | Knowledge Extraction from LLMs for Scalable Historical Data Annotation |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T23%3A28%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_proqu&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Knowledge%20Extraction%20from%20LLMs%20for%20Scalable%20Historical%20Data%20Annotation&rft.jtitle=Electronics%20(Basel)&rft.au=Celli,%20Fabio&rft.date=2024-12-01&rft.volume=13&rft.issue=24&rft.spage=4990&rft.pages=4990-&rft.issn=2079-9292&rft.eissn=2079-9292&rft_id=info:doi/10.3390/electronics13244990&rft_dat=%3Cgale_proqu%3EA821763357%3C/gale_proqu%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3149599680&rft_id=info:pmid/&rft_galeid=A821763357&rfr_iscdi=true |