Developing an online hate classifier for multiple social media platforms

The proliferation of social media enables people to express their opinions widely online. However, at the same time, this has resulted in the emergence of conflict and hate, making online environments uninviting for users. Although researchers have found that hate is a problem across multiple platfo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Human-centric computing and information sciences 2020-01, Vol.10 (1), p.1-34, Article 1
Hauptverfasser: Salminen, Joni, Hopf, Maximilian, Chowdhury, Shammur A., Jung, Soon-gyo, Almerekhi, Hind, Jansen, Bernard J.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 34
container_issue 1
container_start_page 1
container_title Human-centric computing and information sciences
container_volume 10
creator Salminen, Joni
Hopf, Maximilian
Chowdhury, Shammur A.
Jung, Soon-gyo
Almerekhi, Hind
Jansen, Bernard J.
description The proliferation of social media enables people to express their opinions widely online. However, at the same time, this has resulted in the emergence of conflict and hate, making online environments uninviting for users. Although researchers have found that hate is a problem across multiple platforms, there is a lack of models for online hate detection using multi-platform data. To address this research gap, we collect a total of 197,566 comments from four platforms: YouTube, Reddit, Wikipedia, and Twitter, with 80% of the comments labeled as non-hateful and the remaining 20% labeled as hateful. We then experiment with several classification algorithms (Logistic Regression, Naïve Bayes, Support Vector Machines, XGBoost, and Neural Networks) and feature representations (Bag-of-Words, TF-IDF, Word2Vec, BERT, and their combination). While all the models significantly outperform the keyword-based baseline classifier, XGBoost using all features performs the best (F1 = 0.92). Feature importance analysis indicates that BERT features are the most impactful for the predictions. Findings support the generalizability of the best model, as the platform-specific results from Twitter and Wikipedia are comparable to their respective source papers. We make our code publicly available for application in real software systems as well as for further development by online hate researchers.
doi_str_mv 10.1186/s13673-019-0205-6
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2331713340</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2331713340</sourcerecordid><originalsourceid>FETCH-LOGICAL-c359t-562ee9a605c3cc657b10cdfd53297ca501c1f6d3b1fee45831e44481a9847c9c3</originalsourceid><addsrcrecordid>eNp1kE1LAzEQhoMoWLQ_wFvA82om2WQ3R6kfFQpe9BzS7KSmZD9MtoL_3i0r6MXTDMzzvgMPIVfAbgBqdZtBqEoUDHTBOJOFOiELDpoXoBU__bOfk2XOe8YYsIrLSizI-h4_MfZD6HbUdrTvYuiQvtsRqYs25-ADJur7RNtDHMMQkebeBRtpi02wdIh2nK5tviRn3saMy595Qd4eH15X62Lz8vS8utsUTkg9FlJxRG0Vk044p2S1BeYa30jBdeWsZODAq0ZswSOWshaAZVnWYHVdVk47cUGu594h9R8HzKPZ94fUTS8NFwIqEKJkEwUz5VKfc0JvhhRam74MMHN0ZmZnZnJmjs6MmjJ8zuSJ7XaYfpv_D30D_E1uRQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2331713340</pqid></control><display><type>article</type><title>Developing an online hate classifier for multiple social media platforms</title><source>SpringerNature Journals</source><source>Springer Nature OA Free Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Salminen, Joni ; Hopf, Maximilian ; Chowdhury, Shammur A. ; Jung, Soon-gyo ; Almerekhi, Hind ; Jansen, Bernard J.</creator><creatorcontrib>Salminen, Joni ; Hopf, Maximilian ; Chowdhury, Shammur A. ; Jung, Soon-gyo ; Almerekhi, Hind ; Jansen, Bernard J.</creatorcontrib><description>The proliferation of social media enables people to express their opinions widely online. However, at the same time, this has resulted in the emergence of conflict and hate, making online environments uninviting for users. Although researchers have found that hate is a problem across multiple platforms, there is a lack of models for online hate detection using multi-platform data. To address this research gap, we collect a total of 197,566 comments from four platforms: YouTube, Reddit, Wikipedia, and Twitter, with 80% of the comments labeled as non-hateful and the remaining 20% labeled as hateful. We then experiment with several classification algorithms (Logistic Regression, Naïve Bayes, Support Vector Machines, XGBoost, and Neural Networks) and feature representations (Bag-of-Words, TF-IDF, Word2Vec, BERT, and their combination). While all the models significantly outperform the keyword-based baseline classifier, XGBoost using all features performs the best (F1 = 0.92). Feature importance analysis indicates that BERT features are the most impactful for the predictions. Findings support the generalizability of the best model, as the platform-specific results from Twitter and Wikipedia are comparable to their respective source papers. We make our code publicly available for application in real software systems as well as for further development by online hate researchers.</description><identifier>ISSN: 2192-1962</identifier><identifier>EISSN: 2192-1962</identifier><identifier>DOI: 10.1186/s13673-019-0205-6</identifier><language>eng</language><publisher>Berlin/Heidelberg: Springer Berlin Heidelberg</publisher><subject>Algorithms ; Artificial Intelligence ; Classifiers ; Communications Engineering ; Computer Science ; Computer Systems Organization and Communication Networks ; Digital media ; Encyclopedias ; Information Systems and Communication Service ; Information Systems Applications (incl.Internet) ; Networks ; Neural networks ; Regression analysis ; Researchers ; Social networks ; Support vector machines ; User Interfaces and Human Computer Interaction</subject><ispartof>Human-centric computing and information sciences, 2020-01, Vol.10 (1), p.1-34, Article 1</ispartof><rights>The Author(s) 2020</rights><rights>Human-centric Computing and Information Sciences is a copyright of Springer, (2020). All Rights Reserved. © 2020. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c359t-562ee9a605c3cc657b10cdfd53297ca501c1f6d3b1fee45831e44481a9847c9c3</citedby><cites>FETCH-LOGICAL-c359t-562ee9a605c3cc657b10cdfd53297ca501c1f6d3b1fee45831e44481a9847c9c3</cites><orcidid>0000-0003-3230-0561</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1186/s13673-019-0205-6$$EPDF$$P50$$Gspringer$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://doi.org/10.1186/s13673-019-0205-6$$EHTML$$P50$$Gspringer$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,27924,27925,41120,41488,42189,42557,51319,51576</link.rule.ids></links><search><creatorcontrib>Salminen, Joni</creatorcontrib><creatorcontrib>Hopf, Maximilian</creatorcontrib><creatorcontrib>Chowdhury, Shammur A.</creatorcontrib><creatorcontrib>Jung, Soon-gyo</creatorcontrib><creatorcontrib>Almerekhi, Hind</creatorcontrib><creatorcontrib>Jansen, Bernard J.</creatorcontrib><title>Developing an online hate classifier for multiple social media platforms</title><title>Human-centric computing and information sciences</title><addtitle>Hum. Cent. Comput. Inf. Sci</addtitle><description>The proliferation of social media enables people to express their opinions widely online. However, at the same time, this has resulted in the emergence of conflict and hate, making online environments uninviting for users. Although researchers have found that hate is a problem across multiple platforms, there is a lack of models for online hate detection using multi-platform data. To address this research gap, we collect a total of 197,566 comments from four platforms: YouTube, Reddit, Wikipedia, and Twitter, with 80% of the comments labeled as non-hateful and the remaining 20% labeled as hateful. We then experiment with several classification algorithms (Logistic Regression, Naïve Bayes, Support Vector Machines, XGBoost, and Neural Networks) and feature representations (Bag-of-Words, TF-IDF, Word2Vec, BERT, and their combination). While all the models significantly outperform the keyword-based baseline classifier, XGBoost using all features performs the best (F1 = 0.92). Feature importance analysis indicates that BERT features are the most impactful for the predictions. Findings support the generalizability of the best model, as the platform-specific results from Twitter and Wikipedia are comparable to their respective source papers. We make our code publicly available for application in real software systems as well as for further development by online hate researchers.</description><subject>Algorithms</subject><subject>Artificial Intelligence</subject><subject>Classifiers</subject><subject>Communications Engineering</subject><subject>Computer Science</subject><subject>Computer Systems Organization and Communication Networks</subject><subject>Digital media</subject><subject>Encyclopedias</subject><subject>Information Systems and Communication Service</subject><subject>Information Systems Applications (incl.Internet)</subject><subject>Networks</subject><subject>Neural networks</subject><subject>Regression analysis</subject><subject>Researchers</subject><subject>Social networks</subject><subject>Support vector machines</subject><subject>User Interfaces and Human Computer Interaction</subject><issn>2192-1962</issn><issn>2192-1962</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp1kE1LAzEQhoMoWLQ_wFvA82om2WQ3R6kfFQpe9BzS7KSmZD9MtoL_3i0r6MXTDMzzvgMPIVfAbgBqdZtBqEoUDHTBOJOFOiELDpoXoBU__bOfk2XOe8YYsIrLSizI-h4_MfZD6HbUdrTvYuiQvtsRqYs25-ADJur7RNtDHMMQkebeBRtpi02wdIh2nK5tviRn3saMy595Qd4eH15X62Lz8vS8utsUTkg9FlJxRG0Vk044p2S1BeYa30jBdeWsZODAq0ZswSOWshaAZVnWYHVdVk47cUGu594h9R8HzKPZ94fUTS8NFwIqEKJkEwUz5VKfc0JvhhRam74MMHN0ZmZnZnJmjs6MmjJ8zuSJ7XaYfpv_D30D_E1uRQ</recordid><startdate>20200102</startdate><enddate>20200102</enddate><creator>Salminen, Joni</creator><creator>Hopf, Maximilian</creator><creator>Chowdhury, Shammur A.</creator><creator>Jung, Soon-gyo</creator><creator>Almerekhi, Hind</creator><creator>Jansen, Bernard J.</creator><general>Springer Berlin Heidelberg</general><general>Korea Information Processing Society, Computer Software Research Group</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7XB</scope><scope>8AL</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>M0N</scope><scope>P5Z</scope><scope>P62</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope><orcidid>https://orcid.org/0000-0003-3230-0561</orcidid></search><sort><creationdate>20200102</creationdate><title>Developing an online hate classifier for multiple social media platforms</title><author>Salminen, Joni ; Hopf, Maximilian ; Chowdhury, Shammur A. ; Jung, Soon-gyo ; Almerekhi, Hind ; Jansen, Bernard J.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c359t-562ee9a605c3cc657b10cdfd53297ca501c1f6d3b1fee45831e44481a9847c9c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Algorithms</topic><topic>Artificial Intelligence</topic><topic>Classifiers</topic><topic>Communications Engineering</topic><topic>Computer Science</topic><topic>Computer Systems Organization and Communication Networks</topic><topic>Digital media</topic><topic>Encyclopedias</topic><topic>Information Systems and Communication Service</topic><topic>Information Systems Applications (incl.Internet)</topic><topic>Networks</topic><topic>Neural networks</topic><topic>Regression analysis</topic><topic>Researchers</topic><topic>Social networks</topic><topic>Support vector machines</topic><topic>User Interfaces and Human Computer Interaction</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Salminen, Joni</creatorcontrib><creatorcontrib>Hopf, Maximilian</creatorcontrib><creatorcontrib>Chowdhury, Shammur A.</creatorcontrib><creatorcontrib>Jung, Soon-gyo</creatorcontrib><creatorcontrib>Almerekhi, Hind</creatorcontrib><creatorcontrib>Jansen, Bernard J.</creatorcontrib><collection>Springer Nature OA Free Journals</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>Computing Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>Human-centric computing and information sciences</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Salminen, Joni</au><au>Hopf, Maximilian</au><au>Chowdhury, Shammur A.</au><au>Jung, Soon-gyo</au><au>Almerekhi, Hind</au><au>Jansen, Bernard J.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Developing an online hate classifier for multiple social media platforms</atitle><jtitle>Human-centric computing and information sciences</jtitle><stitle>Hum. Cent. Comput. Inf. Sci</stitle><date>2020-01-02</date><risdate>2020</risdate><volume>10</volume><issue>1</issue><spage>1</spage><epage>34</epage><pages>1-34</pages><artnum>1</artnum><issn>2192-1962</issn><eissn>2192-1962</eissn><abstract>The proliferation of social media enables people to express their opinions widely online. However, at the same time, this has resulted in the emergence of conflict and hate, making online environments uninviting for users. Although researchers have found that hate is a problem across multiple platforms, there is a lack of models for online hate detection using multi-platform data. To address this research gap, we collect a total of 197,566 comments from four platforms: YouTube, Reddit, Wikipedia, and Twitter, with 80% of the comments labeled as non-hateful and the remaining 20% labeled as hateful. We then experiment with several classification algorithms (Logistic Regression, Naïve Bayes, Support Vector Machines, XGBoost, and Neural Networks) and feature representations (Bag-of-Words, TF-IDF, Word2Vec, BERT, and their combination). While all the models significantly outperform the keyword-based baseline classifier, XGBoost using all features performs the best (F1 = 0.92). Feature importance analysis indicates that BERT features are the most impactful for the predictions. Findings support the generalizability of the best model, as the platform-specific results from Twitter and Wikipedia are comparable to their respective source papers. We make our code publicly available for application in real software systems as well as for further development by online hate researchers.</abstract><cop>Berlin/Heidelberg</cop><pub>Springer Berlin Heidelberg</pub><doi>10.1186/s13673-019-0205-6</doi><tpages>34</tpages><orcidid>https://orcid.org/0000-0003-3230-0561</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2192-1962
ispartof Human-centric computing and information sciences, 2020-01, Vol.10 (1), p.1-34, Article 1
issn 2192-1962
2192-1962
language eng
recordid cdi_proquest_journals_2331713340
source SpringerNature Journals; Springer Nature OA Free Journals; EZB-FREE-00999 freely available EZB journals
subjects Algorithms
Artificial Intelligence
Classifiers
Communications Engineering
Computer Science
Computer Systems Organization and Communication Networks
Digital media
Encyclopedias
Information Systems and Communication Service
Information Systems Applications (incl.Internet)
Networks
Neural networks
Regression analysis
Researchers
Social networks
Support vector machines
User Interfaces and Human Computer Interaction
title Developing an online hate classifier for multiple social media platforms
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T21%3A23%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Developing%20an%20online%20hate%20classifier%20for%20multiple%20social%20media%20platforms&rft.jtitle=Human-centric%20computing%20and%20information%20sciences&rft.au=Salminen,%20Joni&rft.date=2020-01-02&rft.volume=10&rft.issue=1&rft.spage=1&rft.epage=34&rft.pages=1-34&rft.artnum=1&rft.issn=2192-1962&rft.eissn=2192-1962&rft_id=info:doi/10.1186/s13673-019-0205-6&rft_dat=%3Cproquest_cross%3E2331713340%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2331713340&rft_id=info:pmid/&rfr_iscdi=true