Pipeline and Dataset Generation for Automated Fact-checking in Almost Any Language

This article presents a pipeline for automated fact-checking leveraging publicly available Language Models and data. The objective is to assess the accuracy of textual claims using evidence from a ground-truth evidence corpus. The pipeline consists of two main modules -- the evidence retrieval and t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-12
Hauptverfasser:	Drchal, Jan, Ullrich, Herbert, Mlynář, Tomáš, Moravec, Václav
Format:	Artikel
Sprache:	eng
Schlagworte:	Annotations Automation Computer Science - Computation and Language Datasets Languages Machine translation Pipelines Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Drchal, Jan Ullrich, Herbert Mlynář, Tomáš Moravec, Václav
description	This article presents a pipeline for automated fact-checking leveraging publicly available Language Models and data. The objective is to assess the accuracy of textual claims using evidence from a ground-truth evidence corpus. The pipeline consists of two main modules -- the evidence retrieval and the claim veracity evaluation. Our primary focus is on the ease of deployment in various languages that remain unexplored in the field of automated fact-checking. Unlike most similar pipelines, which work with evidence sentences, our pipeline processes data on a paragraph level, simplifying the overall architecture and data requirements. Given the high cost of annotating language-specific fact-checking training data, our solution builds on the Question Answering for Claim Generation (QACG) method, which we adapt and use to generate the data for all models of the pipeline. Our strategy enables the introduction of new languages through machine translation of only two fixed datasets of moderate size. Subsequently, any number of training samples can be generated based on an evidence corpus in the target language. We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines, as well as to our codebase that may be used to reproduce the results.We comprehensively evaluate the pipelines for all four languages, including human annotations and per-sample difficulty assessment using Pointwise V-information. The presented experiments are based on full Wikipedia snapshots to promote reproducibility. To facilitate implementation and user interaction, we develop the FactSearch application featuring the proposed pipeline and the preliminary feedback on its performance.
doi_str_mv	10.48550/arxiv.2312.10171
format	Article
fullrecord	<record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2312_10171</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2903730950</sourcerecordid><originalsourceid>FETCH-LOGICAL-a951-8e17eafbedeb40224292091cf24f2e6e77275d78a36d1d576e226a7d4a4add1d3</originalsourceid><addsrcrecordid>eNotj1FLwzAUhYMgOOZ-gE8GfO5MbpJmfSzTTWGgyN7LXXM7M7d0pqnov7duPh04HA7fx9iNFFM9M0bcY_z2X1NQEqZSSCsv2AiUktlMA1yxSdfthBCQWzBGjdjbqz_S3gfiGBx_wIQdJb6kQBGTbwNv2sjLPrUHTOT4AuuU1e9Uf_iw5T7wcn9ou8TL8MNXGLY9bumaXTa472jyn2O2Xjyu50_Z6mX5PC9XGRZmwCFpCZsNOdpoAaChAFHIugHdAOVkLVjj7AxV7qQzNieAHK3TqNENjRqz2_PtSbg6Rn_A-FP9iVcn8WFxd14cY_vZU5eqXdvHMDBVUAhllSiMUL-8Klrd</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2903730950</pqid></control><display><type>article</type><title>Pipeline and Dataset Generation for Automated Fact-checking in Almost Any Language</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Drchal, Jan ; Ullrich, Herbert ; Mlynář, Tomáš ; Moravec, Václav</creator><creatorcontrib>Drchal, Jan ; Ullrich, Herbert ; Mlynář, Tomáš ; Moravec, Václav</creatorcontrib><description>This article presents a pipeline for automated fact-checking leveraging publicly available Language Models and data. The objective is to assess the accuracy of textual claims using evidence from a ground-truth evidence corpus. The pipeline consists of two main modules -- the evidence retrieval and the claim veracity evaluation. Our primary focus is on the ease of deployment in various languages that remain unexplored in the field of automated fact-checking. Unlike most similar pipelines, which work with evidence sentences, our pipeline processes data on a paragraph level, simplifying the overall architecture and data requirements. Given the high cost of annotating language-specific fact-checking training data, our solution builds on the Question Answering for Claim Generation (QACG) method, which we adapt and use to generate the data for all models of the pipeline. Our strategy enables the introduction of new languages through machine translation of only two fixed datasets of moderate size. Subsequently, any number of training samples can be generated based on an evidence corpus in the target language. We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines, as well as to our codebase that may be used to reproduce the results.We comprehensively evaluate the pipelines for all four languages, including human annotations and per-sample difficulty assessment using Pointwise V-information. The presented experiments are based on full Wikipedia snapshots to promote reproducibility. To facilitate implementation and user interaction, we develop the FactSearch application featuring the proposed pipeline and the preliminary feedback on its performance.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2312.10171</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Annotations ; Automation ; Computer Science - Computation and Language ; Datasets ; Languages ; Machine translation ; Pipelines ; Training</subject><ispartof>arXiv.org, 2023-12</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,780,881,27902</link.rule.ids><backlink>$$Uhttps://doi.org/10.1007/s00521-024-10113-5$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.10171$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Drchal, Jan</creatorcontrib><creatorcontrib>Ullrich, Herbert</creatorcontrib><creatorcontrib>Mlynář, Tomáš</creatorcontrib><creatorcontrib>Moravec, Václav</creatorcontrib><title>Pipeline and Dataset Generation for Automated Fact-checking in Almost Any Language</title><title>arXiv.org</title><description>This article presents a pipeline for automated fact-checking leveraging publicly available Language Models and data. The objective is to assess the accuracy of textual claims using evidence from a ground-truth evidence corpus. The pipeline consists of two main modules -- the evidence retrieval and the claim veracity evaluation. Our primary focus is on the ease of deployment in various languages that remain unexplored in the field of automated fact-checking. Unlike most similar pipelines, which work with evidence sentences, our pipeline processes data on a paragraph level, simplifying the overall architecture and data requirements. Given the high cost of annotating language-specific fact-checking training data, our solution builds on the Question Answering for Claim Generation (QACG) method, which we adapt and use to generate the data for all models of the pipeline. Our strategy enables the introduction of new languages through machine translation of only two fixed datasets of moderate size. Subsequently, any number of training samples can be generated based on an evidence corpus in the target language. We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines, as well as to our codebase that may be used to reproduce the results.We comprehensively evaluate the pipelines for all four languages, including human annotations and per-sample difficulty assessment using Pointwise V-information. The presented experiments are based on full Wikipedia snapshots to promote reproducibility. To facilitate implementation and user interaction, we develop the FactSearch application featuring the proposed pipeline and the preliminary feedback on its performance.</description><subject>Annotations</subject><subject>Automation</subject><subject>Computer Science - Computation and Language</subject><subject>Datasets</subject><subject>Languages</subject><subject>Machine translation</subject><subject>Pipelines</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><sourceid>GOX</sourceid><recordid>eNotj1FLwzAUhYMgOOZ-gE8GfO5MbpJmfSzTTWGgyN7LXXM7M7d0pqnov7duPh04HA7fx9iNFFM9M0bcY_z2X1NQEqZSSCsv2AiUktlMA1yxSdfthBCQWzBGjdjbqz_S3gfiGBx_wIQdJb6kQBGTbwNv2sjLPrUHTOT4AuuU1e9Uf_iw5T7wcn9ou8TL8MNXGLY9bumaXTa472jyn2O2Xjyu50_Z6mX5PC9XGRZmwCFpCZsNOdpoAaChAFHIugHdAOVkLVjj7AxV7qQzNieAHK3TqNENjRqz2_PtSbg6Rn_A-FP9iVcn8WFxd14cY_vZU5eqXdvHMDBVUAhllSiMUL-8Klrd</recordid><startdate>20231215</startdate><enddate>20231215</enddate><creator>Drchal, Jan</creator><creator>Ullrich, Herbert</creator><creator>Mlynář, Tomáš</creator><creator>Moravec, Václav</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231215</creationdate><title>Pipeline and Dataset Generation for Automated Fact-checking in Almost Any Language</title><author>Drchal, Jan ; Ullrich, Herbert ; Mlynář, Tomáš ; Moravec, Václav</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a951-8e17eafbedeb40224292091cf24f2e6e77275d78a36d1d576e226a7d4a4add1d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Annotations</topic><topic>Automation</topic><topic>Computer Science - Computation and Language</topic><topic>Datasets</topic><topic>Languages</topic><topic>Machine translation</topic><topic>Pipelines</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Drchal, Jan</creatorcontrib><creatorcontrib>Ullrich, Herbert</creatorcontrib><creatorcontrib>Mlynář, Tomáš</creatorcontrib><creatorcontrib>Moravec, Václav</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Drchal, Jan</au><au>Ullrich, Herbert</au><au>Mlynář, Tomáš</au><au>Moravec, Václav</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Pipeline and Dataset Generation for Automated Fact-checking in Almost Any Language</atitle><jtitle>arXiv.org</jtitle><date>2023-12-15</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>This article presents a pipeline for automated fact-checking leveraging publicly available Language Models and data. The objective is to assess the accuracy of textual claims using evidence from a ground-truth evidence corpus. The pipeline consists of two main modules -- the evidence retrieval and the claim veracity evaluation. Our primary focus is on the ease of deployment in various languages that remain unexplored in the field of automated fact-checking. Unlike most similar pipelines, which work with evidence sentences, our pipeline processes data on a paragraph level, simplifying the overall architecture and data requirements. Given the high cost of annotating language-specific fact-checking training data, our solution builds on the Question Answering for Claim Generation (QACG) method, which we adapt and use to generate the data for all models of the pipeline. Our strategy enables the introduction of new languages through machine translation of only two fixed datasets of moderate size. Subsequently, any number of training samples can be generated based on an evidence corpus in the target language. We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines, as well as to our codebase that may be used to reproduce the results.We comprehensively evaluate the pipelines for all four languages, including human annotations and per-sample difficulty assessment using Pointwise V-information. The presented experiments are based on full Wikipedia snapshots to promote reproducibility. To facilitate implementation and user interaction, we develop the FactSearch application featuring the proposed pipeline and the preliminary feedback on its performance.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2312.10171</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-12
issn	2331-8422
language	eng
recordid	cdi_arxiv_primary_2312_10171
source	arXiv.org; Free E- Journals
subjects	Annotations Automation Computer Science - Computation and Language Datasets Languages Machine translation Pipelines Training
title	Pipeline and Dataset Generation for Automated Fact-checking in Almost Any Language
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T20%3A36%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Pipeline%20and%20Dataset%20Generation%20for%20Automated%20Fact-checking%20in%20Almost%20Any%20Language&rft.jtitle=arXiv.org&rft.au=Drchal,%20Jan&rft.date=2023-12-15&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2312.10171&rft_dat=%3Cproquest_arxiv%3E2903730950%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2903730950&rft_id=info:pmid/&rfr_iscdi=true