SDRS: A new lossless dimensionality reduction for text corpora

•Need of migrating from token-based representations to synset-based ones to achieve better performance on spam filtering.•Review of current synset-based feature reduction schemes and representations.•Introducing SDRS feature reduction process based on the usage of NSGA-II algoritm and semantic taxon...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information processing & management 2020-07, Vol.57 (4), p.102249, Article 102249
Hauptverfasser:	de Mendizabal, Iñaki Velez, Basto-Fernandes, Vitor, Ezpeleta, Enaitz, Méndez, José R., Zurutuza, Urko
Format:	Artikel
Sprache:	eng
Schlagworte:	Classifiers Datasets Evolutionary algorithms Genetic algorithms Information retrieval Machine learning Multi-objective evolutionary algorithms Multiple objective analysis Reduction Representations Semantic analysis Semantic-based feature reduction Semantics Sildenafil Sorting algorithms Spam filtering Spamming Synset-based representation Token-based representation
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue	4
container_start_page	102249
container_title	Information processing & management
container_volume	57
creator	de Mendizabal, Iñaki Velez Basto-Fernandes, Vitor Ezpeleta, Enaitz Méndez, José R. Zurutuza, Urko
description	•Need of migrating from token-based representations to synset-based ones to achieve better performance on spam filtering.•Review of current synset-based feature reduction schemes and representations.•Introducing SDRS feature reduction process based on the usage of NSGA-II algoritm and semantic taxonomic relations between tokens.•Design and execute a experimental protocol to test the suitability of SDRS dimensionality reduction method. In recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction. These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels). In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.
doi_str_mv	10.1016/j.ipm.2020.102249
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2438721171</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0306457319314694</els_id><sourcerecordid>2438721171</sourcerecordid><originalsourceid>FETCH-LOGICAL-c394t-4e406844bb83740d3d3fe280e51b83cb6f20bd83f442aeebe2bdd1ec84ce9c013</originalsourceid><addsrcrecordid>eNp9kN1LwzAUxYMoOKd_gG8Bnzvz1TZTEMb8hIHg9Dm0yS2kdE1NMnX_vSn12afLuZxzOfeH0CUlC0pocd0u7LBbMMJGzZhYHqEZlSXPcl7SYzQjnBSZyEt-is5CaAkhIqdshu6292_bG7zCPXzjzoXQQQjY2B30wbq-6mw8YA9mr2OSuHEeR_iJWDs_OF-do5Om6gJc_M05-nh8eF8_Z5vXp5f1apNpvhQxEyBIIYWoa8lLQQw3vAEmCeQ0bXRdNIzURvJGCFYB1MBqYyhoKTQsNaF8jq6mu4N3n3sIUbVu71O9oJjgsmSUlqOLTi7t0yceGjV4u6v8QVGiRkyqVQmTGjGpCVPK3E4ZSPW_LHgVtIVeg7EedFTG2X_Sv8swb2U</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2438721171</pqid></control><display><type>article</type><title>SDRS: A new lossless dimensionality reduction for text corpora</title><source>ScienceDirect Journals (5 years ago - present)</source><creator>de Mendizabal, Iñaki Velez ; Basto-Fernandes, Vitor ; Ezpeleta, Enaitz ; Méndez, José R. ; Zurutuza, Urko</creator><creatorcontrib>de Mendizabal, Iñaki Velez ; Basto-Fernandes, Vitor ; Ezpeleta, Enaitz ; Méndez, José R. ; Zurutuza, Urko</creatorcontrib><description>•Need of migrating from token-based representations to synset-based ones to achieve better performance on spam filtering.•Review of current synset-based feature reduction schemes and representations.•Introducing SDRS feature reduction process based on the usage of NSGA-II algoritm and semantic taxonomic relations between tokens.•Design and execute a experimental protocol to test the suitability of SDRS dimensionality reduction method. In recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction. These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels). In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.</description><identifier>ISSN: 0306-4573</identifier><identifier>EISSN: 1873-5371</identifier><identifier>DOI: 10.1016/j.ipm.2020.102249</identifier><language>eng</language><publisher>Oxford: Elsevier Ltd</publisher><subject>Classifiers ; Datasets ; Evolutionary algorithms ; Genetic algorithms ; Information retrieval ; Machine learning ; Multi-objective evolutionary algorithms ; Multiple objective analysis ; Reduction ; Representations ; Semantic analysis ; Semantic-based feature reduction ; Semantics ; Sildenafil ; Sorting algorithms ; Spam filtering ; Spamming ; Synset-based representation ; Token-based representation</subject><ispartof>Information processing & management, 2020-07, Vol.57 (4), p.102249, Article 102249</ispartof><rights>2020 Elsevier Ltd</rights><rights>Copyright Pergamon Press Inc. Jul 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c394t-4e406844bb83740d3d3fe280e51b83cb6f20bd83f442aeebe2bdd1ec84ce9c013</citedby><cites>FETCH-LOGICAL-c394t-4e406844bb83740d3d3fe280e51b83cb6f20bd83f442aeebe2bdd1ec84ce9c013</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.ipm.2020.102249$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids></links><search><creatorcontrib>de Mendizabal, Iñaki Velez</creatorcontrib><creatorcontrib>Basto-Fernandes, Vitor</creatorcontrib><creatorcontrib>Ezpeleta, Enaitz</creatorcontrib><creatorcontrib>Méndez, José R.</creatorcontrib><creatorcontrib>Zurutuza, Urko</creatorcontrib><title>SDRS: A new lossless dimensionality reduction for text corpora</title><title>Information processing & management</title><description>•Need of migrating from token-based representations to synset-based ones to achieve better performance on spam filtering.•Review of current synset-based feature reduction schemes and representations.•Introducing SDRS feature reduction process based on the usage of NSGA-II algoritm and semantic taxonomic relations between tokens.•Design and execute a experimental protocol to test the suitability of SDRS dimensionality reduction method. In recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction. These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels). In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.</description><subject>Classifiers</subject><subject>Datasets</subject><subject>Evolutionary algorithms</subject><subject>Genetic algorithms</subject><subject>Information retrieval</subject><subject>Machine learning</subject><subject>Multi-objective evolutionary algorithms</subject><subject>Multiple objective analysis</subject><subject>Reduction</subject><subject>Representations</subject><subject>Semantic analysis</subject><subject>Semantic-based feature reduction</subject><subject>Semantics</subject><subject>Sildenafil</subject><subject>Sorting algorithms</subject><subject>Spam filtering</subject><subject>Spamming</subject><subject>Synset-based representation</subject><subject>Token-based representation</subject><issn>0306-4573</issn><issn>1873-5371</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNp9kN1LwzAUxYMoOKd_gG8Bnzvz1TZTEMb8hIHg9Dm0yS2kdE1NMnX_vSn12afLuZxzOfeH0CUlC0pocd0u7LBbMMJGzZhYHqEZlSXPcl7SYzQjnBSZyEt-is5CaAkhIqdshu6292_bG7zCPXzjzoXQQQjY2B30wbq-6mw8YA9mr2OSuHEeR_iJWDs_OF-do5Om6gJc_M05-nh8eF8_Z5vXp5f1apNpvhQxEyBIIYWoa8lLQQw3vAEmCeQ0bXRdNIzURvJGCFYB1MBqYyhoKTQsNaF8jq6mu4N3n3sIUbVu71O9oJjgsmSUlqOLTi7t0yceGjV4u6v8QVGiRkyqVQmTGjGpCVPK3E4ZSPW_LHgVtIVeg7EedFTG2X_Sv8swb2U</recordid><startdate>20200701</startdate><enddate>20200701</enddate><creator>de Mendizabal, Iñaki Velez</creator><creator>Basto-Fernandes, Vitor</creator><creator>Ezpeleta, Enaitz</creator><creator>Méndez, José R.</creator><creator>Zurutuza, Urko</creator><general>Elsevier Ltd</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope></search><sort><creationdate>20200701</creationdate><title>SDRS: A new lossless dimensionality reduction for text corpora</title><author>de Mendizabal, Iñaki Velez ; Basto-Fernandes, Vitor ; Ezpeleta, Enaitz ; Méndez, José R. ; Zurutuza, Urko</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c394t-4e406844bb83740d3d3fe280e51b83cb6f20bd83f442aeebe2bdd1ec84ce9c013</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Classifiers</topic><topic>Datasets</topic><topic>Evolutionary algorithms</topic><topic>Genetic algorithms</topic><topic>Information retrieval</topic><topic>Machine learning</topic><topic>Multi-objective evolutionary algorithms</topic><topic>Multiple objective analysis</topic><topic>Reduction</topic><topic>Representations</topic><topic>Semantic analysis</topic><topic>Semantic-based feature reduction</topic><topic>Semantics</topic><topic>Sildenafil</topic><topic>Sorting algorithms</topic><topic>Spam filtering</topic><topic>Spamming</topic><topic>Synset-based representation</topic><topic>Token-based representation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>de Mendizabal, Iñaki Velez</creatorcontrib><creatorcontrib>Basto-Fernandes, Vitor</creatorcontrib><creatorcontrib>Ezpeleta, Enaitz</creatorcontrib><creatorcontrib>Méndez, José R.</creatorcontrib><creatorcontrib>Zurutuza, Urko</creatorcontrib><collection>CrossRef</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><jtitle>Information processing & management</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>de Mendizabal, Iñaki Velez</au><au>Basto-Fernandes, Vitor</au><au>Ezpeleta, Enaitz</au><au>Méndez, José R.</au><au>Zurutuza, Urko</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SDRS: A new lossless dimensionality reduction for text corpora</atitle><jtitle>Information processing & management</jtitle><date>2020-07-01</date><risdate>2020</risdate><volume>57</volume><issue>4</issue><spage>102249</spage><pages>102249-</pages><artnum>102249</artnum><issn>0306-4573</issn><eissn>1873-5371</eissn><abstract>•Need of migrating from token-based representations to synset-based ones to achieve better performance on spam filtering.•Review of current synset-based feature reduction schemes and representations.•Introducing SDRS feature reduction process based on the usage of NSGA-II algoritm and semantic taxonomic relations between tokens.•Design and execute a experimental protocol to test the suitability of SDRS dimensionality reduction method. In recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction. These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels). In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.</abstract><cop>Oxford</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.ipm.2020.102249</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0306-4573
ispartof	Information processing & management, 2020-07, Vol.57 (4), p.102249, Article 102249
issn	0306-4573 1873-5371
language	eng
recordid	cdi_proquest_journals_2438721171
source	ScienceDirect Journals (5 years ago - present)
subjects	Classifiers Datasets Evolutionary algorithms Genetic algorithms Information retrieval Machine learning Multi-objective evolutionary algorithms Multiple objective analysis Reduction Representations Semantic analysis Semantic-based feature reduction Semantics Sildenafil Sorting algorithms Spam filtering Spamming Synset-based representation Token-based representation
title	SDRS: A new lossless dimensionality reduction for text corpora
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T02%3A08%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SDRS:%20A%20new%20lossless%20dimensionality%20reduction%20for%20text%20corpora&rft.jtitle=Information%20processing%20&%20management&rft.au=de%20Mendizabal,%20I%C3%B1aki%20Velez&rft.date=2020-07-01&rft.volume=57&rft.issue=4&rft.spage=102249&rft.pages=102249-&rft.artnum=102249&rft.issn=0306-4573&rft.eissn=1873-5371&rft_id=info:doi/10.1016/j.ipm.2020.102249&rft_dat=%3Cproquest_cross%3E2438721171%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2438721171&rft_id=info:pmid/&rft_els_id=S0306457319314694&rfr_iscdi=true