Antisemitic Messages? A Guide to High-Quality Annotation and a Labeled Dataset of Tweets

One of the major challenges in automatic hate speech detection is the lack of datasets that cover a wide range of biased and unbiased messages and that are consistently labeled. We propose a labeling procedure that addresses some of the common weaknesses of labeled datasets. We focus on antisemitic...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Jikeli, Gunther, Karali, Sameer, Miehling, Daniel, Soemer, Katharina
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Computers and Society
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Jikeli, Gunther Karali, Sameer Miehling, Daniel Soemer, Katharina
description	One of the major challenges in automatic hate speech detection is the lack of datasets that cover a wide range of biased and unbiased messages and that are consistently labeled. We propose a labeling procedure that addresses some of the common weaknesses of labeled datasets. We focus on antisemitic speech on Twitter and create a labeled dataset of 6,941 tweets that cover a wide range of topics common in conversations about Jews, Israel, and antisemitism between January 2019 and December 2021 by drawing from representative samples with relevant keywords. Our annotation process aims to strictly apply a commonly used definition of antisemitism by forcing annotators to specify which part of the definition applies, and by giving them the option to personally disagree with the definition on a case-by-case basis. Labeling tweets that call out antisemitism, report antisemitism, or are otherwise related to antisemitism (such as the Holocaust) but are not actually antisemitic can help reduce false positives in automated detection. The dataset includes 1,250 tweets (18%) that are antisemitic according to the International Holocaust Remembrance Alliance (IHRA) definition of antisemitism. It is important to note, however, that the dataset is not comprehensive. Many topics are still not covered, and it only includes tweets collected from Twitter between January 2019 and December 2021. Additionally, the dataset only includes tweets that were written in English. Despite these limitations, we hope that this is a meaningful contribution to improving the automated detection of antisemitic speech.
doi_str_mv	10.48550/arxiv.2304.14599
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2304_14599</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2304_14599</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-b3ad2fb971152f8182f643a5f4b636aebac309d721b13f447b65d0d10e2fc3</originalsourceid><addsrcrecordid>eNotz7tOwzAUgGEvDKjwAEycF0jwNZcJRQVapCBUqQNbdBwfF0tpgmIX6NtXLUz_9ksfY3eC57oyhj_g_Bu-c6m4zoU2dX3NPpoxhUj7kEIPbxQj7ig-QgOrQ3AEaYJ12H1mmwMOIR2hGccpYQrTCDg6QGjR0kAOnjBhpASTh-0PUYo37MrjEOn2vwu2eXneLtdZ-756XTZthkVZZ1ahk97WpRBG-kpU0hdaofHaFqpAstgrXrtSCiuU17q0hXHcCU7S92rB7v-eF1f3NYc9zsfu7OsuPnUCcrBKwg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Antisemitic Messages? A Guide to High-Quality Annotation and a Labeled Dataset of Tweets</title><source>arXiv.org</source><creator>Jikeli, Gunther ; Karali, Sameer ; Miehling, Daniel ; Soemer, Katharina</creator><creatorcontrib>Jikeli, Gunther ; Karali, Sameer ; Miehling, Daniel ; Soemer, Katharina</creatorcontrib><description>One of the major challenges in automatic hate speech detection is the lack of datasets that cover a wide range of biased and unbiased messages and that are consistently labeled. We propose a labeling procedure that addresses some of the common weaknesses of labeled datasets. We focus on antisemitic speech on Twitter and create a labeled dataset of 6,941 tweets that cover a wide range of topics common in conversations about Jews, Israel, and antisemitism between January 2019 and December 2021 by drawing from representative samples with relevant keywords. Our annotation process aims to strictly apply a commonly used definition of antisemitism by forcing annotators to specify which part of the definition applies, and by giving them the option to personally disagree with the definition on a case-by-case basis. Labeling tweets that call out antisemitism, report antisemitism, or are otherwise related to antisemitism (such as the Holocaust) but are not actually antisemitic can help reduce false positives in automated detection. The dataset includes 1,250 tweets (18%) that are antisemitic according to the International Holocaust Remembrance Alliance (IHRA) definition of antisemitism. It is important to note, however, that the dataset is not comprehensive. Many topics are still not covered, and it only includes tweets collected from Twitter between January 2019 and December 2021. Additionally, the dataset only includes tweets that were written in English. Despite these limitations, we hope that this is a meaningful contribution to improving the automated detection of antisemitic speech.</description><identifier>DOI: 10.48550/arxiv.2304.14599</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Computers and Society</subject><creationdate>2023-04</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2304.14599$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2304.14599$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Jikeli, Gunther</creatorcontrib><creatorcontrib>Karali, Sameer</creatorcontrib><creatorcontrib>Miehling, Daniel</creatorcontrib><creatorcontrib>Soemer, Katharina</creatorcontrib><title>Antisemitic Messages? A Guide to High-Quality Annotation and a Labeled Dataset of Tweets</title><description>One of the major challenges in automatic hate speech detection is the lack of datasets that cover a wide range of biased and unbiased messages and that are consistently labeled. We propose a labeling procedure that addresses some of the common weaknesses of labeled datasets. We focus on antisemitic speech on Twitter and create a labeled dataset of 6,941 tweets that cover a wide range of topics common in conversations about Jews, Israel, and antisemitism between January 2019 and December 2021 by drawing from representative samples with relevant keywords. Our annotation process aims to strictly apply a commonly used definition of antisemitism by forcing annotators to specify which part of the definition applies, and by giving them the option to personally disagree with the definition on a case-by-case basis. Labeling tweets that call out antisemitism, report antisemitism, or are otherwise related to antisemitism (such as the Holocaust) but are not actually antisemitic can help reduce false positives in automated detection. The dataset includes 1,250 tweets (18%) that are antisemitic according to the International Holocaust Remembrance Alliance (IHRA) definition of antisemitism. It is important to note, however, that the dataset is not comprehensive. Many topics are still not covered, and it only includes tweets collected from Twitter between January 2019 and December 2021. Additionally, the dataset only includes tweets that were written in English. Despite these limitations, we hope that this is a meaningful contribution to improving the automated detection of antisemitic speech.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computers and Society</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7tOwzAUgGEvDKjwAEycF0jwNZcJRQVapCBUqQNbdBwfF0tpgmIX6NtXLUz_9ksfY3eC57oyhj_g_Bu-c6m4zoU2dX3NPpoxhUj7kEIPbxQj7ig-QgOrQ3AEaYJ12H1mmwMOIR2hGccpYQrTCDg6QGjR0kAOnjBhpASTh-0PUYo37MrjEOn2vwu2eXneLtdZ-756XTZthkVZZ1ahk97WpRBG-kpU0hdaofHaFqpAstgrXrtSCiuU17q0hXHcCU7S92rB7v-eF1f3NYc9zsfu7OsuPnUCcrBKwg</recordid><startdate>20230427</startdate><enddate>20230427</enddate><creator>Jikeli, Gunther</creator><creator>Karali, Sameer</creator><creator>Miehling, Daniel</creator><creator>Soemer, Katharina</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230427</creationdate><title>Antisemitic Messages? A Guide to High-Quality Annotation and a Labeled Dataset of Tweets</title><author>Jikeli, Gunther ; Karali, Sameer ; Miehling, Daniel ; Soemer, Katharina</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-b3ad2fb971152f8182f643a5f4b636aebac309d721b13f447b65d0d10e2fc3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computers and Society</topic><toplevel>online_resources</toplevel><creatorcontrib>Jikeli, Gunther</creatorcontrib><creatorcontrib>Karali, Sameer</creatorcontrib><creatorcontrib>Miehling, Daniel</creatorcontrib><creatorcontrib>Soemer, Katharina</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Jikeli, Gunther</au><au>Karali, Sameer</au><au>Miehling, Daniel</au><au>Soemer, Katharina</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Antisemitic Messages? A Guide to High-Quality Annotation and a Labeled Dataset of Tweets</atitle><date>2023-04-27</date><risdate>2023</risdate><abstract>One of the major challenges in automatic hate speech detection is the lack of datasets that cover a wide range of biased and unbiased messages and that are consistently labeled. We propose a labeling procedure that addresses some of the common weaknesses of labeled datasets. We focus on antisemitic speech on Twitter and create a labeled dataset of 6,941 tweets that cover a wide range of topics common in conversations about Jews, Israel, and antisemitism between January 2019 and December 2021 by drawing from representative samples with relevant keywords. Our annotation process aims to strictly apply a commonly used definition of antisemitism by forcing annotators to specify which part of the definition applies, and by giving them the option to personally disagree with the definition on a case-by-case basis. Labeling tweets that call out antisemitism, report antisemitism, or are otherwise related to antisemitism (such as the Holocaust) but are not actually antisemitic can help reduce false positives in automated detection. The dataset includes 1,250 tweets (18%) that are antisemitic according to the International Holocaust Remembrance Alliance (IHRA) definition of antisemitism. It is important to note, however, that the dataset is not comprehensive. Many topics are still not covered, and it only includes tweets collected from Twitter between January 2019 and December 2021. Additionally, the dataset only includes tweets that were written in English. Despite these limitations, we hope that this is a meaningful contribution to improving the automated detection of antisemitic speech.</abstract><doi>10.48550/arxiv.2304.14599</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2304.14599
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2304_14599
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Computers and Society
title	Antisemitic Messages? A Guide to High-Quality Annotation and a Labeled Dataset of Tweets
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T11%3A57%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Antisemitic%20Messages?%20A%20Guide%20to%20High-Quality%20Annotation%20and%20a%20Labeled%20Dataset%20of%20Tweets&rft.au=Jikeli,%20Gunther&rft.date=2023-04-27&rft_id=info:doi/10.48550/arxiv.2304.14599&rft_dat=%3Carxiv_GOX%3E2304_14599%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true