naab: A ready-to-use plug-and-play corpus for Farsi

The rise of large language models (LLMs) has transformed numerous natural language processing (NLP) tasks, yet their performance in low and mid-resource languages, such as Farsi, still lags behind resource-rich languages like English. To address this gap, we introduce naab, the largest publicly avai...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-12
Hauptverfasser:	Sadra Sabouri, Rahmati, Elnaz, Gooran, Soroush, Sameti, Hossein
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Large language models Natural language processing Performance enhancement
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The rise of large language models (LLMs) has transformed numerous natural language processing (NLP) tasks, yet their performance in low and mid-resource languages, such as Farsi, still lags behind resource-rich languages like English. To address this gap, we introduce naab, the largest publicly available, cleaned, and ready-to-use Farsi textual corpus. naab consists of 130GB of data, comprising over 250 million paragraphs and 15 billion words. Named after the Farsi word NAAB (meaning "pure" or "high-grade"), this corpus is openly accessible via Hugging Face, offering researchers a valuable resource for Farsi NLP tasks. In addition to naab, we provide naab-raw, an unprocessed version of the dataset, along with a pre-processing toolkit that allows users to clean their custom corpora. These resources empower NLP researchers and practitioners, particularly those focusing on low-resource languages, to improve the performance of LLMs in their respective domains and bridge the gap between resource-rich and resource-poor languages.
ISSN:	2331-8422
DOI:	10.48550/arxiv.2208.13486