How Good is Your Wikipedia?
Wikipedia's perceived high quality and broad language coverage have established it as a fundamental resource in multilingual NLP. In the context of low-resource languages, however, these quality assumptions are increasingly being scrutinised. This paper critically examines the data quality of W...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Wikipedia's perceived high quality and broad language coverage have
established it as a fundamental resource in multilingual NLP. In the context of
low-resource languages, however, these quality assumptions are increasingly
being scrutinised. This paper critically examines the data quality of Wikipedia
in a non-English setting by subjecting it to various quality filtering
techniques, revealing widespread issues such as a high percentage of one-line
articles and duplicate articles. We evaluate the downstream impact of quality
filtering on Wikipedia and find that data quality pruning is an effective means
for resource-efficient training without hurting performance, especially for
low-resource languages. Moreover, we advocate for a shift in perspective from
seeking a general definition of data quality towards a more language- and
task-specific one. Ultimately, we aim for this study to serve as a guide to
using Wikipedia for pretraining in a multilingual setting. |
---|---|
DOI: | 10.48550/arxiv.2411.05527 |