Applying effective feature selection techniques with hierarchical mixtures of experts for spam classification

E-mail abuse has been steadily increasing during the last decade. E-mail users find themselves targeted by massive quantities of unsolicited bulk e-mail, which often contains offensive language or has fraudulent intentions. Internet Service Providers (ISPs) on the other hand, have to face a consider...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of computer security 2008-01, Vol.16 (6), p.761-790
Hauptverfasser: Belsis, Petros, Fragos, Kostas, Gritzalis, Stefanos, Skourlas, Christos
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 790
container_issue 6
container_start_page 761
container_title Journal of computer security
container_volume 16
creator Belsis, Petros
Fragos, Kostas
Gritzalis, Stefanos
Skourlas, Christos
description E-mail abuse has been steadily increasing during the last decade. E-mail users find themselves targeted by massive quantities of unsolicited bulk e-mail, which often contains offensive language or has fraudulent intentions. Internet Service Providers (ISPs) on the other hand, have to face a considerable system overloading as the incoming mail consumes network and storage resources. Among the plethora of solutions, the most prominent in terms of cost efficiency and complexity are the text filtering approaches. Most of the approaches model the problem using linear statistical models. Despite their popularity - due both to their simplicity and relative ease of interpretation - the non-linearity assumption of data samples is inappropriate in practice. This is mainly due to the inability of other approaches to capture the apparent non-linear relationships, which characterize these samples. In this paper, we propose a margin-based feature selection approach integrated with a Hierarchical Mixtures of Experts (HME) system, which attempts to overcome limitations common to other machine-learning based approaches. By reducing the data dimensionality using effective algorithms for feature selection we evaluated our system with publicly available corpora of e-mails, characterized by very high similarity between legitimate and bulk e-mail (and thus low discriminative potential). We experimented with two different architectures, a hierarchical HME and a perceptron HME. As a result, we confirm the domination of our Spam Filtering (SF) - HME method against other machine learning approaches, which present lesser degree of recall, as well as against traditional rule-based approaches, which lack considerably in the achieved degrees of precision.
doi_str_mv 10.3233/JCS-2008-0319
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_35560324</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>35560324</sourcerecordid><originalsourceid>FETCH-LOGICAL-c224t-cf73f64c307d8913195951a1c841ea2c8264f190f2b22ca2b6607e922a8c0f053</originalsourceid><addsrcrecordid>eNotkEtLAzEUhYMoWKtL91m5iyY380iWpfik4EIFdyGNN05kXiYz2v57Z6irC4ePw7kfIZeCX0uQ8uZp_cKAc8W4FPqILIQqc6Y0ZMdkwTUUDKB8PyVnKX1xDkJotSDNqu_rfWg_KXqPbgg_SD3aYYxIE9Zz0rV0QFe14XvERH_DUNEqYLTRVcHZmjZhN-OJdp7irsc4JOq7SFNvG-pqm1LwEzgXnZMTb-uEF_93Sd7ubl_XD2zzfP-4Xm2YA8gG5nwpfZE5ycsPpcX0Ta5zYYVTmUALTkGReaG5hy2As7AtCl6iBrDKcc9zuSRXh94-dvPqwTQhOaxr22I3JiPzvOASsglkB9DFLqWI3vQxNDbujeBmlmomqWaWamap8g-mSWwm</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>35560324</pqid></control><display><type>article</type><title>Applying effective feature selection techniques with hierarchical mixtures of experts for spam classification</title><source>EBSCOhost Business Source Complete</source><creator>Belsis, Petros ; Fragos, Kostas ; Gritzalis, Stefanos ; Skourlas, Christos</creator><creatorcontrib>Belsis, Petros ; Fragos, Kostas ; Gritzalis, Stefanos ; Skourlas, Christos</creatorcontrib><description>E-mail abuse has been steadily increasing during the last decade. E-mail users find themselves targeted by massive quantities of unsolicited bulk e-mail, which often contains offensive language or has fraudulent intentions. Internet Service Providers (ISPs) on the other hand, have to face a considerable system overloading as the incoming mail consumes network and storage resources. Among the plethora of solutions, the most prominent in terms of cost efficiency and complexity are the text filtering approaches. Most of the approaches model the problem using linear statistical models. Despite their popularity - due both to their simplicity and relative ease of interpretation - the non-linearity assumption of data samples is inappropriate in practice. This is mainly due to the inability of other approaches to capture the apparent non-linear relationships, which characterize these samples. In this paper, we propose a margin-based feature selection approach integrated with a Hierarchical Mixtures of Experts (HME) system, which attempts to overcome limitations common to other machine-learning based approaches. By reducing the data dimensionality using effective algorithms for feature selection we evaluated our system with publicly available corpora of e-mails, characterized by very high similarity between legitimate and bulk e-mail (and thus low discriminative potential). We experimented with two different architectures, a hierarchical HME and a perceptron HME. As a result, we confirm the domination of our Spam Filtering (SF) - HME method against other machine learning approaches, which present lesser degree of recall, as well as against traditional rule-based approaches, which lack considerably in the achieved degrees of precision.</description><identifier>ISSN: 0926-227X</identifier><identifier>EISSN: 1875-8924</identifier><identifier>DOI: 10.3233/JCS-2008-0319</identifier><language>eng</language><ispartof>Journal of computer security, 2008-01, Vol.16 (6), p.761-790</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Belsis, Petros</creatorcontrib><creatorcontrib>Fragos, Kostas</creatorcontrib><creatorcontrib>Gritzalis, Stefanos</creatorcontrib><creatorcontrib>Skourlas, Christos</creatorcontrib><title>Applying effective feature selection techniques with hierarchical mixtures of experts for spam classification</title><title>Journal of computer security</title><description>E-mail abuse has been steadily increasing during the last decade. E-mail users find themselves targeted by massive quantities of unsolicited bulk e-mail, which often contains offensive language or has fraudulent intentions. Internet Service Providers (ISPs) on the other hand, have to face a considerable system overloading as the incoming mail consumes network and storage resources. Among the plethora of solutions, the most prominent in terms of cost efficiency and complexity are the text filtering approaches. Most of the approaches model the problem using linear statistical models. Despite their popularity - due both to their simplicity and relative ease of interpretation - the non-linearity assumption of data samples is inappropriate in practice. This is mainly due to the inability of other approaches to capture the apparent non-linear relationships, which characterize these samples. In this paper, we propose a margin-based feature selection approach integrated with a Hierarchical Mixtures of Experts (HME) system, which attempts to overcome limitations common to other machine-learning based approaches. By reducing the data dimensionality using effective algorithms for feature selection we evaluated our system with publicly available corpora of e-mails, characterized by very high similarity between legitimate and bulk e-mail (and thus low discriminative potential). We experimented with two different architectures, a hierarchical HME and a perceptron HME. As a result, we confirm the domination of our Spam Filtering (SF) - HME method against other machine learning approaches, which present lesser degree of recall, as well as against traditional rule-based approaches, which lack considerably in the achieved degrees of precision.</description><issn>0926-227X</issn><issn>1875-8924</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2008</creationdate><recordtype>article</recordtype><recordid>eNotkEtLAzEUhYMoWKtL91m5iyY380iWpfik4EIFdyGNN05kXiYz2v57Z6irC4ePw7kfIZeCX0uQ8uZp_cKAc8W4FPqILIQqc6Y0ZMdkwTUUDKB8PyVnKX1xDkJotSDNqu_rfWg_KXqPbgg_SD3aYYxIE9Zz0rV0QFe14XvERH_DUNEqYLTRVcHZmjZhN-OJdp7irsc4JOq7SFNvG-pqm1LwEzgXnZMTb-uEF_93Sd7ubl_XD2zzfP-4Xm2YA8gG5nwpfZE5ycsPpcX0Ta5zYYVTmUALTkGReaG5hy2As7AtCl6iBrDKcc9zuSRXh94-dvPqwTQhOaxr22I3JiPzvOASsglkB9DFLqWI3vQxNDbujeBmlmomqWaWamap8g-mSWwm</recordid><startdate>20080101</startdate><enddate>20080101</enddate><creator>Belsis, Petros</creator><creator>Fragos, Kostas</creator><creator>Gritzalis, Stefanos</creator><creator>Skourlas, Christos</creator><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20080101</creationdate><title>Applying effective feature selection techniques with hierarchical mixtures of experts for spam classification</title><author>Belsis, Petros ; Fragos, Kostas ; Gritzalis, Stefanos ; Skourlas, Christos</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c224t-cf73f64c307d8913195951a1c841ea2c8264f190f2b22ca2b6607e922a8c0f053</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2008</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Belsis, Petros</creatorcontrib><creatorcontrib>Fragos, Kostas</creatorcontrib><creatorcontrib>Gritzalis, Stefanos</creatorcontrib><creatorcontrib>Skourlas, Christos</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Journal of computer security</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Belsis, Petros</au><au>Fragos, Kostas</au><au>Gritzalis, Stefanos</au><au>Skourlas, Christos</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Applying effective feature selection techniques with hierarchical mixtures of experts for spam classification</atitle><jtitle>Journal of computer security</jtitle><date>2008-01-01</date><risdate>2008</risdate><volume>16</volume><issue>6</issue><spage>761</spage><epage>790</epage><pages>761-790</pages><issn>0926-227X</issn><eissn>1875-8924</eissn><abstract>E-mail abuse has been steadily increasing during the last decade. E-mail users find themselves targeted by massive quantities of unsolicited bulk e-mail, which often contains offensive language or has fraudulent intentions. Internet Service Providers (ISPs) on the other hand, have to face a considerable system overloading as the incoming mail consumes network and storage resources. Among the plethora of solutions, the most prominent in terms of cost efficiency and complexity are the text filtering approaches. Most of the approaches model the problem using linear statistical models. Despite their popularity - due both to their simplicity and relative ease of interpretation - the non-linearity assumption of data samples is inappropriate in practice. This is mainly due to the inability of other approaches to capture the apparent non-linear relationships, which characterize these samples. In this paper, we propose a margin-based feature selection approach integrated with a Hierarchical Mixtures of Experts (HME) system, which attempts to overcome limitations common to other machine-learning based approaches. By reducing the data dimensionality using effective algorithms for feature selection we evaluated our system with publicly available corpora of e-mails, characterized by very high similarity between legitimate and bulk e-mail (and thus low discriminative potential). We experimented with two different architectures, a hierarchical HME and a perceptron HME. As a result, we confirm the domination of our Spam Filtering (SF) - HME method against other machine learning approaches, which present lesser degree of recall, as well as against traditional rule-based approaches, which lack considerably in the achieved degrees of precision.</abstract><doi>10.3233/JCS-2008-0319</doi><tpages>30</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0926-227X
ispartof Journal of computer security, 2008-01, Vol.16 (6), p.761-790
issn 0926-227X
1875-8924
language eng
recordid cdi_proquest_miscellaneous_35560324
source EBSCOhost Business Source Complete
title Applying effective feature selection techniques with hierarchical mixtures of experts for spam classification
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T00%3A22%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Applying%20effective%20feature%20selection%20techniques%20with%20hierarchical%20mixtures%20of%20experts%20for%20spam%20classification&rft.jtitle=Journal%20of%20computer%20security&rft.au=Belsis,%20Petros&rft.date=2008-01-01&rft.volume=16&rft.issue=6&rft.spage=761&rft.epage=790&rft.pages=761-790&rft.issn=0926-227X&rft.eissn=1875-8924&rft_id=info:doi/10.3233/JCS-2008-0319&rft_dat=%3Cproquest_cross%3E35560324%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=35560324&rft_id=info:pmid/&rfr_iscdi=true