Guide for the application of the data augmentation approach on sets of texts in Spanish for sentiment and emotion analysis

Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:PloS one 2024-09, Vol.19 (9), p.e0310707
Hauptverfasser: Gutiérrez Benítez, Rodrigo, Segura Navarrete, Alejandra, Vidal-Castro, Christian, Martínez-Araneda, Claudia
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 9
container_start_page e0310707
container_title PloS one
container_volume 19
creator Gutiérrez Benítez, Rodrigo
Segura Navarrete, Alejandra
Vidal-Castro, Christian
Martínez-Araneda, Claudia
description Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
doi_str_mv 10.1371/journal.pone.0310707
format Article
fullrecord <record><control><sourceid>gale_plos_</sourceid><recordid>TN_cdi_plos_journals_3110279204</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A810211360</galeid><sourcerecordid>A810211360</sourcerecordid><originalsourceid>FETCH-LOGICAL-c506t-8ebdf91839d73587956eb305f0b48f1f41f058feff937d97ef39107902c302633</originalsourceid><addsrcrecordid>eNqNkl1r1UAQhoMotlb_gWhAkHpxjrPZfF5JKbYWCoVWvV32JLMnWza7MZuI9dc7OUnLifSi7MUOO8_77jAzQfCWwZrxjH2-dUNnpVm3zuIaOIMMsmfBISt4tEoj4M_34oPglfe3AAnP0_RlcMApkWQJHAZ_zwddYahcF_Y1hrJtjS5lr50Nndo9VbKXoRy2Ddp-ShDUOVnWIcUee78j8Q8F2oY3rbTa1ztHTxI96kJpqxAbN8mp6juv_evghZLG45v5Pgp-nH39fvptdXl1fnF6crkqE0j7VY6bShUs50WV8STPiiTFDYdEwSbOFVMxU5DkCpUqeFYVGSpeUDMKiEoOUcr5UfB-8m2N82JumxecMYiyIoKYiC8zMWwarEqquJNGtJ1uZHcnnNRimbG6Flv3WzAWR2mcRORwPDt07teAvheN9iUaIy26YfosBoh4SuiH_9DHS5qprTQotFWOPi5HU3GSE8UYT4Go9SMUnQobXdJiKE3vC8GnhYCYcXJbOXgvLm6un85e_VyyH_fYGqXpa-_MMA7cL8F4AsvOed-heugyAzHu9X03xLjXYt5rkr3bn9CD6H6R-T_M7_LR</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3110279204</pqid></control><display><type>article</type><title>Guide for the application of the data augmentation approach on sets of texts in Spanish for sentiment and emotion analysis</title><source>Public Library of Science (PLoS) Journals Open Access</source><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>Free Full-Text Journals in Chemistry</source><creator>Gutiérrez Benítez, Rodrigo ; Segura Navarrete, Alejandra ; Vidal-Castro, Christian ; Martínez-Araneda, Claudia</creator><creatorcontrib>Gutiérrez Benítez, Rodrigo ; Segura Navarrete, Alejandra ; Vidal-Castro, Christian ; Martínez-Araneda, Claudia</creatorcontrib><description>Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.</description><identifier>ISSN: 1932-6203</identifier><identifier>EISSN: 1932-6203</identifier><identifier>DOI: 10.1371/journal.pone.0310707</identifier><identifier>PMID: 39325750</identifier><language>eng</language><publisher>United States: Public Library of Science</publisher><subject>Accuracy ; Algorithms ; Analysis ; Artificial neural networks ; Biology and Life Sciences ; Classification ; Communications industry ; Computational linguistics ; Computer and Information Sciences ; Data analysis ; Data augmentation ; Data mining ; Datasets ; Deep Learning ; Digital media ; Education ; Emotions ; Evaluation ; Generative adversarial networks ; Humans ; Language ; Language processing ; Machine Learning ; Mass media ; Methods ; Natural language interfaces ; Natural Language Processing ; Neural networks ; Neural Networks, Computer ; Performance evaluation ; Physical Sciences ; Research and Analysis Methods ; Semantics ; Sentiment analysis ; Social Media ; Social networks ; Social organization ; Social Sciences ; Spanish language ; Support vector machines ; Surveys ; Taxonomy ; Text processing ; Texts ; Translation</subject><ispartof>PloS one, 2024-09, Vol.19 (9), p.e0310707</ispartof><rights>Copyright: © 2024 Gutiérrez Benítez et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</rights><rights>COPYRIGHT 2024 Public Library of Science</rights><rights>2024 Gutiérrez Benítez et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2024 Gutiérrez Benítez et al 2024 Gutiérrez Benítez et al</rights><rights>2024 Gutiérrez Benítez et al. This is an open access article distributed under the terms of the Creative Commons Attribution License: http://creativecommons.org/licenses/by/4.0/ (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c506t-8ebdf91839d73587956eb305f0b48f1f41f058feff937d97ef39107902c302633</cites><orcidid>0000-0003-0949-7415 ; 0000-0001-9188-7919 ; 0000-0002-6881-5156</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC11426452/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC11426452/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,2915,23845,27901,27902,53766,53768,79342,79343</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39325750$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Gutiérrez Benítez, Rodrigo</creatorcontrib><creatorcontrib>Segura Navarrete, Alejandra</creatorcontrib><creatorcontrib>Vidal-Castro, Christian</creatorcontrib><creatorcontrib>Martínez-Araneda, Claudia</creatorcontrib><title>Guide for the application of the data augmentation approach on sets of texts in Spanish for sentiment and emotion analysis</title><title>PloS one</title><addtitle>PLoS One</addtitle><description>Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.</description><subject>Accuracy</subject><subject>Algorithms</subject><subject>Analysis</subject><subject>Artificial neural networks</subject><subject>Biology and Life Sciences</subject><subject>Classification</subject><subject>Communications industry</subject><subject>Computational linguistics</subject><subject>Computer and Information Sciences</subject><subject>Data analysis</subject><subject>Data augmentation</subject><subject>Data mining</subject><subject>Datasets</subject><subject>Deep Learning</subject><subject>Digital media</subject><subject>Education</subject><subject>Emotions</subject><subject>Evaluation</subject><subject>Generative adversarial networks</subject><subject>Humans</subject><subject>Language</subject><subject>Language processing</subject><subject>Machine Learning</subject><subject>Mass media</subject><subject>Methods</subject><subject>Natural language interfaces</subject><subject>Natural Language Processing</subject><subject>Neural networks</subject><subject>Neural Networks, Computer</subject><subject>Performance evaluation</subject><subject>Physical Sciences</subject><subject>Research and Analysis Methods</subject><subject>Semantics</subject><subject>Sentiment analysis</subject><subject>Social Media</subject><subject>Social networks</subject><subject>Social organization</subject><subject>Social Sciences</subject><subject>Spanish language</subject><subject>Support vector machines</subject><subject>Surveys</subject><subject>Taxonomy</subject><subject>Text processing</subject><subject>Texts</subject><subject>Translation</subject><issn>1932-6203</issn><issn>1932-6203</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><sourceid>BENPR</sourceid><recordid>eNqNkl1r1UAQhoMotlb_gWhAkHpxjrPZfF5JKbYWCoVWvV32JLMnWza7MZuI9dc7OUnLifSi7MUOO8_77jAzQfCWwZrxjH2-dUNnpVm3zuIaOIMMsmfBISt4tEoj4M_34oPglfe3AAnP0_RlcMApkWQJHAZ_zwddYahcF_Y1hrJtjS5lr50Nndo9VbKXoRy2Ddp-ShDUOVnWIcUee78j8Q8F2oY3rbTa1ztHTxI96kJpqxAbN8mp6juv_evghZLG45v5Pgp-nH39fvptdXl1fnF6crkqE0j7VY6bShUs50WV8STPiiTFDYdEwSbOFVMxU5DkCpUqeFYVGSpeUDMKiEoOUcr5UfB-8m2N82JumxecMYiyIoKYiC8zMWwarEqquJNGtJ1uZHcnnNRimbG6Flv3WzAWR2mcRORwPDt07teAvheN9iUaIy26YfosBoh4SuiH_9DHS5qprTQotFWOPi5HU3GSE8UYT4Go9SMUnQobXdJiKE3vC8GnhYCYcXJbOXgvLm6un85e_VyyH_fYGqXpa-_MMA7cL8F4AsvOed-heugyAzHu9X03xLjXYt5rkr3bn9CD6H6R-T_M7_LR</recordid><startdate>20240926</startdate><enddate>20240926</enddate><creator>Gutiérrez Benítez, Rodrigo</creator><creator>Segura Navarrete, Alejandra</creator><creator>Vidal-Castro, Christian</creator><creator>Martínez-Araneda, Claudia</creator><general>Public Library of Science</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>IOV</scope><scope>ISR</scope><scope>3V.</scope><scope>7QG</scope><scope>7QL</scope><scope>7QO</scope><scope>7RV</scope><scope>7SN</scope><scope>7SS</scope><scope>7T5</scope><scope>7TG</scope><scope>7TM</scope><scope>7U9</scope><scope>7X2</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8AO</scope><scope>8C1</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AEUYN</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>ATCPS</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BHPHI</scope><scope>C1K</scope><scope>CCPQU</scope><scope>COVID</scope><scope>D1I</scope><scope>DWQXO</scope><scope>FR3</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>H94</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>KB.</scope><scope>KB0</scope><scope>KL.</scope><scope>L6V</scope><scope>LK8</scope><scope>M0K</scope><scope>M0S</scope><scope>M1P</scope><scope>M7N</scope><scope>M7P</scope><scope>M7S</scope><scope>NAPCQ</scope><scope>P5Z</scope><scope>P62</scope><scope>P64</scope><scope>PATMY</scope><scope>PDBOC</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PTHSS</scope><scope>PYCSY</scope><scope>RC3</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0003-0949-7415</orcidid><orcidid>https://orcid.org/0000-0001-9188-7919</orcidid><orcidid>https://orcid.org/0000-0002-6881-5156</orcidid></search><sort><creationdate>20240926</creationdate><title>Guide for the application of the data augmentation approach on sets of texts in Spanish for sentiment and emotion analysis</title><author>Gutiérrez Benítez, Rodrigo ; Segura Navarrete, Alejandra ; Vidal-Castro, Christian ; Martínez-Araneda, Claudia</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c506t-8ebdf91839d73587956eb305f0b48f1f41f058feff937d97ef39107902c302633</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Algorithms</topic><topic>Analysis</topic><topic>Artificial neural networks</topic><topic>Biology and Life Sciences</topic><topic>Classification</topic><topic>Communications industry</topic><topic>Computational linguistics</topic><topic>Computer and Information Sciences</topic><topic>Data analysis</topic><topic>Data augmentation</topic><topic>Data mining</topic><topic>Datasets</topic><topic>Deep Learning</topic><topic>Digital media</topic><topic>Education</topic><topic>Emotions</topic><topic>Evaluation</topic><topic>Generative adversarial networks</topic><topic>Humans</topic><topic>Language</topic><topic>Language processing</topic><topic>Machine Learning</topic><topic>Mass media</topic><topic>Methods</topic><topic>Natural language interfaces</topic><topic>Natural Language Processing</topic><topic>Neural networks</topic><topic>Neural Networks, Computer</topic><topic>Performance evaluation</topic><topic>Physical Sciences</topic><topic>Research and Analysis Methods</topic><topic>Semantics</topic><topic>Sentiment analysis</topic><topic>Social Media</topic><topic>Social networks</topic><topic>Social organization</topic><topic>Social Sciences</topic><topic>Spanish language</topic><topic>Support vector machines</topic><topic>Surveys</topic><topic>Taxonomy</topic><topic>Text processing</topic><topic>Texts</topic><topic>Translation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Gutiérrez Benítez, Rodrigo</creatorcontrib><creatorcontrib>Segura Navarrete, Alejandra</creatorcontrib><creatorcontrib>Vidal-Castro, Christian</creatorcontrib><creatorcontrib>Martínez-Araneda, Claudia</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Gale In Context: Opposing Viewpoints</collection><collection>Gale In Context: Science</collection><collection>ProQuest Central (Corporate)</collection><collection>Animal Behavior Abstracts</collection><collection>Bacteriology Abstracts (Microbiology B)</collection><collection>Biotechnology Research Abstracts</collection><collection>Nursing &amp; Allied Health Database</collection><collection>Ecology Abstracts</collection><collection>Entomology Abstracts (Full archive)</collection><collection>Immunology Abstracts</collection><collection>Meteorological &amp; Geoastrophysical Abstracts</collection><collection>Nucleic Acids Abstracts</collection><collection>Virology and AIDS Abstracts</collection><collection>Agricultural Science Collection</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Public Health Database</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest One Sustainability</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>Agricultural &amp; Environmental Science Collection</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>Natural Science Collection</collection><collection>Environmental Sciences and Pollution Management</collection><collection>ProQuest One Community College</collection><collection>Coronavirus Research Database</collection><collection>ProQuest Materials Science Collection</collection><collection>ProQuest Central Korea</collection><collection>Engineering Research Database</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>AIDS and Cancer Research Abstracts</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Materials Science Database</collection><collection>Nursing &amp; Allied Health Database (Alumni Edition)</collection><collection>Meteorological &amp; Geoastrophysical Abstracts - Academic</collection><collection>ProQuest Engineering Collection</collection><collection>ProQuest Biological Science Collection</collection><collection>Agricultural Science Database</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Algology Mycology and Protozoology Abstracts (Microbiology C)</collection><collection>Biological Science Database</collection><collection>Engineering Database</collection><collection>Nursing &amp; Allied Health Premium</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Environmental Science Database</collection><collection>Materials Science Collection</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>Engineering Collection</collection><collection>Environmental Science Collection</collection><collection>Genetics Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>PloS one</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Gutiérrez Benítez, Rodrigo</au><au>Segura Navarrete, Alejandra</au><au>Vidal-Castro, Christian</au><au>Martínez-Araneda, Claudia</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Guide for the application of the data augmentation approach on sets of texts in Spanish for sentiment and emotion analysis</atitle><jtitle>PloS one</jtitle><addtitle>PLoS One</addtitle><date>2024-09-26</date><risdate>2024</risdate><volume>19</volume><issue>9</issue><spage>e0310707</spage><pages>e0310707-</pages><issn>1932-6203</issn><eissn>1932-6203</eissn><abstract>Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.</abstract><cop>United States</cop><pub>Public Library of Science</pub><pmid>39325750</pmid><doi>10.1371/journal.pone.0310707</doi><tpages>e0310707</tpages><orcidid>https://orcid.org/0000-0003-0949-7415</orcidid><orcidid>https://orcid.org/0000-0001-9188-7919</orcidid><orcidid>https://orcid.org/0000-0002-6881-5156</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1932-6203
ispartof PloS one, 2024-09, Vol.19 (9), p.e0310707
issn 1932-6203
1932-6203
language eng
recordid cdi_plos_journals_3110279204
source Public Library of Science (PLoS) Journals Open Access; MEDLINE; DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals; PubMed Central; Free Full-Text Journals in Chemistry
subjects Accuracy
Algorithms
Analysis
Artificial neural networks
Biology and Life Sciences
Classification
Communications industry
Computational linguistics
Computer and Information Sciences
Data analysis
Data augmentation
Data mining
Datasets
Deep Learning
Digital media
Education
Emotions
Evaluation
Generative adversarial networks
Humans
Language
Language processing
Machine Learning
Mass media
Methods
Natural language interfaces
Natural Language Processing
Neural networks
Neural Networks, Computer
Performance evaluation
Physical Sciences
Research and Analysis Methods
Semantics
Sentiment analysis
Social Media
Social networks
Social organization
Social Sciences
Spanish language
Support vector machines
Surveys
Taxonomy
Text processing
Texts
Translation
title Guide for the application of the data augmentation approach on sets of texts in Spanish for sentiment and emotion analysis
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T14%3A46%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_plos_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Guide%20for%20the%20application%20of%20the%20data%20augmentation%20approach%20on%20sets%20of%20texts%20in%20Spanish%20for%20sentiment%20and%20emotion%20analysis&rft.jtitle=PloS%20one&rft.au=Guti%C3%A9rrez%20Ben%C3%ADtez,%20Rodrigo&rft.date=2024-09-26&rft.volume=19&rft.issue=9&rft.spage=e0310707&rft.pages=e0310707-&rft.issn=1932-6203&rft.eissn=1932-6203&rft_id=info:doi/10.1371/journal.pone.0310707&rft_dat=%3Cgale_plos_%3EA810211360%3C/gale_plos_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3110279204&rft_id=info:pmid/39325750&rft_galeid=A810211360&rfr_iscdi=true