Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish

Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Cognitive computation 2022, Vol.14 (1), p.407-424
Hauptverfasser:	Tessore, Juan Pablo, Esnaola, Leonardo Martín, Lanzarini, Laura, Baldassarri, Sandra
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Artificial Intelligence Automatic text analysis Classification Computation by Abstract Devices Computational Biology/Bioinformatics Computer Science Data mining Datasets Digital media Emotions Error analysis Machine learning Product reviews Sentiment analysis Social networks Software Spanish language Word sense disambiguation
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	424
container_issue	1
container_start_page	407
container_title	Cognitive computation
container_volume	14
creator	Tessore, Juan Pablo Esnaola, Leonardo Martín Lanzarini, Laura Baldassarri, Sandra
description	Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.
doi_str_mv	10.1007/s12559-020-09800-x
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2919481957</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2919481957</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-a2ebd7a356e18e70276028f74741d5500a81e52d5dc132ff54a6c09e143b95613</originalsourceid><addsrcrecordid>eNp9kM1OwzAQhC0EEqXwApwscQ7YTuzER9SWH6nAoeVsbROnpErsYDtVeXvcFsGN0-56Z2blD6FrSm4pIfmdp4xzmRBGEiILQpLdCRrRQohESpGd_vZcnKML7zeECC45G6EwbXwAE_Bi6LXbNl5XeGKND24oQ2MNBlPh2RbaAQ6jrTHgV7vVLZ5CAK_D_mnW2f02WcJ6HQMWtmygxS-6aiCmdZ02wePG4EUPpvEfl-ishtbrq586Ru8Ps-XkKZm_PT5P7udJmVIZEmB6VeWQcqFpoXPCckFYUedZntGKc0KgoJqzilclTVld8wxESaSmWbqKP6XpGN0cc3tnPwftg9rYwZl4UjFJZVZQyfOoYkdV6az3Tteqd00H7ktRovZ01ZGuinTVga7aRVN6NPkoNmvt_qL_cX0Dj_d92w</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2919481957</pqid></control><display><type>article</type><title>Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish</title><source>SpringerLink (Online service)</source><source>ProQuest Central</source><creator>Tessore, Juan Pablo ; Esnaola, Leonardo Martín ; Lanzarini, Laura ; Baldassarri, Sandra</creator><creatorcontrib>Tessore, Juan Pablo ; Esnaola, Leonardo Martín ; Lanzarini, Laura ; Baldassarri, Sandra</creatorcontrib><description>Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.</description><identifier>ISSN: 1866-9956</identifier><identifier>EISSN: 1866-9964</identifier><identifier>DOI: 10.1007/s12559-020-09800-x</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Algorithms ; Artificial Intelligence ; Automatic text analysis ; Classification ; Computation by Abstract Devices ; Computational Biology/Bioinformatics ; Computer Science ; Data mining ; Datasets ; Digital media ; Emotions ; Error analysis ; Machine learning ; Product reviews ; Sentiment analysis ; Social networks ; Software ; Spanish language ; Word sense disambiguation</subject><ispartof>Cognitive computation, 2022, Vol.14 (1), p.407-424</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature 2021</rights><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature 2021.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-a2ebd7a356e18e70276028f74741d5500a81e52d5dc132ff54a6c09e143b95613</citedby><cites>FETCH-LOGICAL-c319t-a2ebd7a356e18e70276028f74741d5500a81e52d5dc132ff54a6c09e143b95613</cites><orcidid>0000-0002-2111-0976 ; 0000-0002-9315-6391 ; 0000-0001-6298-9019 ; 0000-0001-7027-7564</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s12559-020-09800-x$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2919481957?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,776,780,21367,27901,27902,33721,41464,42533,43781,51294</link.rule.ids></links><search><creatorcontrib>Tessore, Juan Pablo</creatorcontrib><creatorcontrib>Esnaola, Leonardo Martín</creatorcontrib><creatorcontrib>Lanzarini, Laura</creatorcontrib><creatorcontrib>Baldassarri, Sandra</creatorcontrib><title>Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish</title><title>Cognitive computation</title><addtitle>Cogn Comput</addtitle><description>Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.</description><subject>Algorithms</subject><subject>Artificial Intelligence</subject><subject>Automatic text analysis</subject><subject>Classification</subject><subject>Computation by Abstract Devices</subject><subject>Computational Biology/Bioinformatics</subject><subject>Computer Science</subject><subject>Data mining</subject><subject>Datasets</subject><subject>Digital media</subject><subject>Emotions</subject><subject>Error analysis</subject><subject>Machine learning</subject><subject>Product reviews</subject><subject>Sentiment analysis</subject><subject>Social networks</subject><subject>Software</subject><subject>Spanish language</subject><subject>Word sense disambiguation</subject><issn>1866-9956</issn><issn>1866-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNp9kM1OwzAQhC0EEqXwApwscQ7YTuzER9SWH6nAoeVsbROnpErsYDtVeXvcFsGN0-56Z2blD6FrSm4pIfmdp4xzmRBGEiILQpLdCRrRQohESpGd_vZcnKML7zeECC45G6EwbXwAE_Bi6LXbNl5XeGKND24oQ2MNBlPh2RbaAQ6jrTHgV7vVLZ5CAK_D_mnW2f02WcJ6HQMWtmygxS-6aiCmdZ02wePG4EUPpvEfl-ishtbrq586Ru8Ps-XkKZm_PT5P7udJmVIZEmB6VeWQcqFpoXPCckFYUedZntGKc0KgoJqzilclTVld8wxESaSmWbqKP6XpGN0cc3tnPwftg9rYwZl4UjFJZVZQyfOoYkdV6az3Tteqd00H7ktRovZ01ZGuinTVga7aRVN6NPkoNmvt_qL_cX0Dj_d92w</recordid><startdate>2022</startdate><enddate>2022</enddate><creator>Tessore, Juan Pablo</creator><creator>Esnaola, Leonardo Martín</creator><creator>Lanzarini, Laura</creator><creator>Baldassarri, Sandra</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><orcidid>https://orcid.org/0000-0002-2111-0976</orcidid><orcidid>https://orcid.org/0000-0002-9315-6391</orcidid><orcidid>https://orcid.org/0000-0001-6298-9019</orcidid><orcidid>https://orcid.org/0000-0001-7027-7564</orcidid></search><sort><creationdate>2022</creationdate><title>Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish</title><author>Tessore, Juan Pablo ; Esnaola, Leonardo Martín ; Lanzarini, Laura ; Baldassarri, Sandra</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-a2ebd7a356e18e70276028f74741d5500a81e52d5dc132ff54a6c09e143b95613</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Algorithms</topic><topic>Artificial Intelligence</topic><topic>Automatic text analysis</topic><topic>Classification</topic><topic>Computation by Abstract Devices</topic><topic>Computational Biology/Bioinformatics</topic><topic>Computer Science</topic><topic>Data mining</topic><topic>Datasets</topic><topic>Digital media</topic><topic>Emotions</topic><topic>Error analysis</topic><topic>Machine learning</topic><topic>Product reviews</topic><topic>Sentiment analysis</topic><topic>Social networks</topic><topic>Software</topic><topic>Spanish language</topic><topic>Word sense disambiguation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tessore, Juan Pablo</creatorcontrib><creatorcontrib>Esnaola, Leonardo Martín</creatorcontrib><creatorcontrib>Lanzarini, Laura</creatorcontrib><creatorcontrib>Baldassarri, Sandra</creatorcontrib><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Database‎ (1962 - current)</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><jtitle>Cognitive computation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tessore, Juan Pablo</au><au>Esnaola, Leonardo Martín</au><au>Lanzarini, Laura</au><au>Baldassarri, Sandra</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish</atitle><jtitle>Cognitive computation</jtitle><stitle>Cogn Comput</stitle><date>2022</date><risdate>2022</risdate><volume>14</volume><issue>1</issue><spage>407</spage><epage>424</epage><pages>407-424</pages><issn>1866-9956</issn><eissn>1866-9964</eissn><abstract>Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s12559-020-09800-x</doi><tpages>18</tpages><orcidid>https://orcid.org/0000-0002-2111-0976</orcidid><orcidid>https://orcid.org/0000-0002-9315-6391</orcidid><orcidid>https://orcid.org/0000-0001-6298-9019</orcidid><orcidid>https://orcid.org/0000-0001-7027-7564</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1866-9956
ispartof	Cognitive computation, 2022, Vol.14 (1), p.407-424
issn	1866-9956 1866-9964
language	eng
recordid	cdi_proquest_journals_2919481957
source	SpringerLink (Online service); ProQuest Central
subjects	Algorithms Artificial Intelligence Automatic text analysis Classification Computation by Abstract Devices Computational Biology/Bioinformatics Computer Science Data mining Datasets Digital media Emotions Error analysis Machine learning Product reviews Sentiment analysis Social networks Software Spanish language Word sense disambiguation
title	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-11T18%3A07%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Distant%20Supervised%20Construction%20and%20Evaluation%20of%20a%20Novel%20Dataset%20of%20Emotion-Tagged%20Social%20Media%20Comments%20in%20Spanish&rft.jtitle=Cognitive%20computation&rft.au=Tessore,%20Juan%20Pablo&rft.date=2022&rft.volume=14&rft.issue=1&rft.spage=407&rft.epage=424&rft.pages=407-424&rft.issn=1866-9956&rft.eissn=1866-9964&rft_id=info:doi/10.1007/s12559-020-09800-x&rft_dat=%3Cproquest_cross%3E2919481957%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2919481957&rft_id=info:pmid/&rfr_iscdi=true