My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks
The research on code-mixed data is limited due to the unavailability of dedicated code-mixed datasets and pre-trained language models. In this work, we focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The research on code-mixed data is limited due to the unavailability of
dedicated code-mixed datasets and pre-trained language models. In this work, we
focus on the low-resource Indian language Marathi which lacks any prior work in
code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English
(Mr-En) corpus with 10 million social media sentences for pretraining. We also
release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models
pre-trained on MeCorpus. Furthermore, for benchmarking, we present three
supervised datasets MeHate, MeSent, and MeLID for downstream tasks like
code-mixed Mr-En hate speech detection, sentiment analysis, and language
identification respectively. These evaluation datasets individually consist of
manually annotated \url{~}12,000 Marathi-English code-mixed tweets. Ablations
show that the models trained on this novel corpus significantly outperform the
existing state-of-the-art BERT models. This is the first work that presents
artifacts for code-mixed Marathi research. All datasets and models are publicly
released at https://github.com/l3cube-pune/MarathiNLP . |
---|---|
DOI: | 10.48550/arxiv.2306.14030 |