Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five majo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Transactions of the Association for Computational Linguistics 2022-01, Vol.10, p.50-72
Hauptverfasser: Kreutzer, Julia, Caswell, Isaac, Wang, Lisa, Wahab, Ahsan, van Esch, Daan, Ulzii-Orshikh, Nasanbayar, Tapo, Allahsera, Subramani, Nishant, Sokolov, Artem, Sikasote, Claytone, Setyawan, Monang, Sarin, Supheakmungkol, Samb, Sokhar, Sagot, Benoît, Rivera, Clara, Rios, Annette, Papadimitriou, Isabel, Osei, Salomey, Suarez, Pedro Ortiz, Orife, Iroro, Ogueji, Kelechi, Rubungo, Andre Niyongabo, Nguyen, Toan Q., Müller, Mathias, Müller, André, Muhammad, Shamsuddeen Hassan, Muhammad, Nanda, Mnyakeni, Ayanda, Mirzakhalov, Jamshidbek, Matangira, Tapiwanashe, Leong, Colin, Lawson, Nze, Kudugunta, Sneha, Jernite, Yacine, Jenny, Mathias, Firat, Orhan, Dossou, Bonaventure F. P., Dlamini, Sakhile, de Silva, Nisansa, Çabuk Ballı, Sakine, Biderman, Stella, Battisti, Alessia, Baruwa, Ahmed, Bapna, Ankur, Baljekar, Pallavi, Azime, Israel Abebe, Awokoya, Ayodele, Ataman, Duygu, Ahia, Orevaoghene, Ahia, Oghenefego, Agrawal, Sweta, Adeyemi, Mofetoluwa
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
ISSN:2307-387X
2307-387X
DOI:10.1162/tacl_a_00447