ViFactCheck: A New Benchmark Dataset and Methods for Multi-domain News Fact-Checking in Vietnamese
The rapid spread of information in the digital age highlights the critical need for effective fact-checking tools, particularly for languages with limited resources, such as Vietnamese. In response to this challenge, we introduce ViFactCheck, the first publicly available benchmark dataset designed s...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The rapid spread of information in the digital age highlights the critical
need for effective fact-checking tools, particularly for languages with limited
resources, such as Vietnamese. In response to this challenge, we introduce
ViFactCheck, the first publicly available benchmark dataset designed
specifically for Vietnamese fact-checking across multiple online news domains.
This dataset contains 7,232 human-annotated pairs of claim-evidence
combinations sourced from reputable Vietnamese online news, covering 12 diverse
topics. It has been subjected to a meticulous annotation process to ensure high
quality and reliability, achieving a Fleiss Kappa inter-annotator agreement
score of 0.83. Our evaluation leverages state-of-the-art pre-trained and large
language models, employing fine-tuning and prompting techniques to assess
performance. Notably, the Gemma model demonstrated superior effectiveness, with
an impressive macro F1 score of 89.90%, thereby establishing a new standard for
fact-checking benchmarks. This result highlights the robust capabilities of
Gemma in accurately identifying and verifying facts in Vietnamese. To further
promote advances in fact-checking technology and improve the reliability of
digital media, we have made the ViFactCheck dataset, model checkpoints,
fact-checking pipelines, and source code freely available on GitHub. This
initiative aims to inspire further research and enhance the accuracy of
information in low-resource languages. |
---|---|
DOI: | 10.48550/arxiv.2412.15308 |