HOCON34k: A Corpus of Hate speech in Online Comments from German Newspapers

We have compiled a dataset containing 34,223 comments in German, authored by users from online-platforms associated with public discourse in German newspapers. Each comment was annotated for hate speech and the adequacy of contextual information by a group of 29 volunteers, using a binary annotation...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Keller, Max-Emanuel, Auch, Maximilian, Döschl, Alexander, Vlk, Fabian, Quernheim, Julian, Hartmann, Mike, Mandl, Peter, Kaul, Alexander, Franz, Markus
Format: Dataset
Sprache:ger
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We have compiled a dataset containing 34,223 comments in German, authored by users from online-platforms associated with public discourse in German newspapers. Each comment was annotated for hate speech and the adequacy of contextual information by a group of 29 volunteers, using a binary annotation approach. The inter-rater reliability for hate speech is 0.4428 across all annotators and increases to 0.6078 when considering an optimized subset of 12 annotators, as measured by Fleiss’ Kappa. Additionally, we present a baseline text classification using BERT, achieving an MCC-score up to 0.32 and an F2-score up to 0.64 in our initial experiment on this new corpus. The data set, named HOCON34k, comprising German hate speech comments from newspapers, is publicly available for research purposes.
DOI:10.5281/zenodo.12665947