Bangla Corpus

Bangla is one of the most spoken languages in the world but Bangla NLP research is in its early stage of development due to the lack of quality public corpus. In this article, we describe the detailed compilation methodology of a comprehensive monolingual Bangla corpus, KUMono. Thiscorpus consists o...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Akther, Aysha Akther, ISLAM, MD. SHYMON ISLAM, SULTANA, HAFSA SULTANA, RAHMAN, A.K.Z RASEL RAHMAN, SAHA, SUJANA SAHA, ALAM, KAZI MASUDUL ALAM
Format: Dataset
Sprache:eng
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Bangla is one of the most spoken languages in the world but Bangla NLP research is in its early stage of development due to the lack of quality public corpus. In this article, we describe the detailed compilation methodology of a comprehensive monolingual Bangla corpus, KUMono. Thiscorpus consists of more than 353 million word tokens in total as well as more than one million unique tokens from 18 major text categories of online Bangla websites.
DOI:10.21227/3bhm-my48