OMCD: Offensive Moroccan Comments Dataset

Offensive content, such as verbal attacks, demeaning comments, or hate speech, has become widespread on social media. Automatic detection of this content is considered an important and challenging task. Although several research works have been proposed to address this challenge for high-resource la...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Language resources and evaluation 2023-12, Vol.57 (4), p.1745-1765
Hauptverfasser:	Essefar, Kabil, Ait Baha, Hassan, El Mahdaouy, Abdelkader, El Mekki, Abdellah, Berrada, Ismail
Format:	Artikel
Sprache:	eng
Schlagworte:	Annotations Arabic language Computational Linguistics Computer Science Data collection Datasets Deep learning Dialects Hate speech Language and Literature Linguistics Machine learning Natural language processing Original Paper Social media Social Sciences State-of-the-art reviews Statistical analysis
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Offensive content, such as verbal attacks, demeaning comments, or hate speech, has become widespread on social media. Automatic detection of this content is considered an important and challenging task. Although several research works have been proposed to address this challenge for high-resource languages, research on detecting offensive content in Dialectal Arabic (DA) remains under-explored. Recently, the detection of offensive language in DA has gained increasing interest among researchers in Natural Language Processing (NLP). However, only a limited number of annotated datasets have been introduced for single or multiple coarse-grained dialects. In this paper, we introduce Offensive Moroccan Comments Dataset (OMCD), the first dataset for offensive language detection for the Moroccan dialect. First, we present the data collection steps, the statistical analysis, and the annotation guidelines of the introduced dataset. Then, we evaluate several state-of-the-art Machine Learning (ML) and Deep Learning (DL) based models on the OMCD dataset. Finally, we highlight the impact of emojis on the evaluated models for offensive language detection.
ISSN:	1574-020X 1574-0218
DOI:	10.1007/s10579-023-09663-2