WoLLaI Mal-Eng: Word Level Language Identification of Malayalam-English Code-Mixed Text
WoLLaI Mal-Eng is a carefully curated and annotated dataset, particularly for word-level language identification in Malayalam-English code-mixed text. The dataset consists of a set of 12,402 sentences, thoroughly tokenized for optimal representation. The dataset file is organized into three columns...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Dataset |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | WoLLaI Mal-Eng is a carefully curated and annotated dataset, particularly for word-level language identification in Malayalam-English code-mixed text. The dataset consists of a set of 12,402 sentences, thoroughly tokenized for optimal representation. The dataset file is organized into three columns such as sentence#, words, and language. Language annotation is thoughtfully categorized into four distinct classes: Mal, Eng, Mix, and Othr. The words that belong to the Malayalam language and are recognized by Malayalam speakers are annotated as Mal. The words that belong to the English language and are easily recognized by English speakers are annotated as Eng. Words that are formed by combining Malayalam and English words where Malayalam suffixes were added to the end of English words or parts of English to enhance comprehension for Malayalam speakers are annotated as Mix. The words of diverse elements such as numbers, abbreviations, and named entities are annotated as Othr. |
---|---|
DOI: | 10.17632/tzrcrrwz4n.1 |