Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling
Social media has effectively become the prime hub of communication and digital marketing. As these platforms enable the free manifestation of thoughts and facts in text, images and video, there is an extensive need to screen them to protect individuals and groups from offensive content targeted at t...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Social media has effectively become the prime hub of communication and
digital marketing. As these platforms enable the free manifestation of thoughts
and facts in text, images and video, there is an extensive need to screen them
to protect individuals and groups from offensive content targeted at them. Our
work intends to classify codemixed social media comments/posts in the Dravidian
languages of Tamil, Kannada, and Malayalam. We intend to improve offensive
language identification by generating pseudo-labels on the dataset. A custom
dataset is constructed by transliterating all the code-mixed texts into the
respective Dravidian language, either Kannada, Malayalam, or Tamil and then
generating pseudo-labels for the transliterated dataset. The two datasets are
combined using the generated pseudo-labels to create a custom dataset called
CMTRA. As Dravidian languages are under-resourced, our approach increases the
amount of training data for the language models. We fine-tune several recent
pretrained language models on the newly constructed dataset. We extract the
pretrained language embeddings and pass them onto recurrent neural networks. We
observe that fine-tuning ULMFiT on the custom dataset yields the best results
on the code-mixed test sets of all three languages. Our approach yields the
best results among the benchmarked models on Tamil-English, achieving a
weighted F1-Score of 0.7934 while scoring competitive weighted F1-Scores of
0.9624 and 0.7306 on the code-mixed test sets of Malayalam-English and
Kannada-English, respectively. |
---|---|
DOI: | 10.48550/arxiv.2108.12177 |