OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification
Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several works have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, mo...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Code-mixing is a well-studied linguistic phenomenon when two or more
languages are mixed in text or speech. Several works have been conducted on
building datasets and performing downstream NLP tasks on code-mixed data.
Although it is not uncommon to observe code-mixing of three or more languages,
most available datasets in this domain contain code-mixed data from only two
languages. In this paper, we introduce OffMix-3L, a novel offensive language
identification dataset containing code-mixed data from three different
languages. We experiment with several models on this dataset and observe that
BanglishBERT outperforms other transformer-based models and GPT-3.5. |
---|---|
DOI: | 10.48550/arxiv.2310.18387 |