BanTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla
The proliferation of transliterated texts in digital spaces has emphasized the need for detecting and classifying hate speech in languages beyond English, particularly in low-resource languages. As online discourse can perpetuate discrimination based on target groups, e.g. gender, religion, and orig...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The proliferation of transliterated texts in digital spaces has emphasized
the need for detecting and classifying hate speech in languages beyond English,
particularly in low-resource languages. As online discourse can perpetuate
discrimination based on target groups, e.g. gender, religion, and origin,
multi-label classification of hateful content can help in comprehending hate
motivation and enhance content moderation. While previous efforts have focused
on monolingual or binary hate classification tasks, no work has yet addressed
the challenge of multi-label hate speech classification in transliterated
Bangla. We introduce BanTH, the first multi-label transliterated Bangla hate
speech dataset comprising 37.3k samples. The samples are sourced from YouTube
comments, where each instance is labeled with one or more target groups,
reflecting the regional demographic. We establish novel transformer
encoder-based baselines by further pre-training on transliterated Bangla
corpus. We also propose a novel translation-based LLM prompting strategy for
transliterated text. Experiments reveal that our further pre-trained encoders
are achieving state-of-the-art performance on the BanTH dataset, while our
translation-based prompting outperforms other strategies in the zero-shot
setting. The introduction of BanTH not only fills a critical gap in hate speech
research for Bangla but also sets the stage for future exploration into
code-mixed and multi-label classification challenges in underrepresented
languages. |
---|---|
DOI: | 10.48550/arxiv.2410.13281 |