A Novel Resampling Technique for Imbalanced Dataset Optimization

Despite the enormous amount of data, particular events of interest can still be quite rare. Classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection. Many studies have been developed for malware detec...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2020-12
Hauptverfasser:	Letteri, Ivan, Antonio Di Cecco, Dyoub, Abeer, Giuseppe Della Penna
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Datasets Machine learning Malware Modules Optimization Outliers (statistics) Oversampling Resampling
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Letteri, Ivan Antonio Di Cecco Dyoub, Abeer Giuseppe Della Penna
description	Despite the enormous amount of data, particular events of interest can still be quite rare. Classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection. Many studies have been developed for malware detection using machine learning approaches on various datasets, but as far as we know only the MTA-KDD'19 dataset has the peculiarity of updating the representative set of malicious traffic on a daily basis. This daily updating is the added value of the dataset, but it translates into a potential due to the class imbalance problem that the RRw-Optimized MTA-KDD'19 will occur. We capture difficulties of class distribution in real datasets by considering four types of minority class examples: safe, borderline, rare and outliers. In this work, we developed two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem. The first module of G1Nos algorithms performs a coefficient-based instance selection silhouette identifying the critical threshold of Imbalance Degree. (ID), the second module generates synthetic samples using a SMOTE-like oversampling algorithm. The balancing of the classes is done by our G1Nos algorithms to re-establish the proportions between the two classes of the used dataset. The experimental results show that our oversampling algorithm work better than the other two SOTA methodologies in all the metrics considered.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2474508049</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2474508049</sourcerecordid><originalsourceid>FETCH-proquest_journals_24745080493</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRwcFTwyy9LzVEISi1OzC3IycxLVwhJTc7IyywsTVVIyy9S8MxNSsxJzEtOTVFwSSxJLE4tUfAvKMnMzaxKLMnMz-NhYE1LzClO5YXS3AzKbq4hzh66BUX5QCOKS-Kz8kuL8oBS8UYm5iamBhYGJpbGxKkCABQ4OOw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2474508049</pqid></control><display><type>article</type><title>A Novel Resampling Technique for Imbalanced Dataset Optimization</title><source>Free E- Journals</source><creator>Letteri, Ivan ; Antonio Di Cecco ; Dyoub, Abeer ; Giuseppe Della Penna</creator><creatorcontrib>Letteri, Ivan ; Antonio Di Cecco ; Dyoub, Abeer ; Giuseppe Della Penna</creatorcontrib><description>Despite the enormous amount of data, particular events of interest can still be quite rare. Classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection. Many studies have been developed for malware detection using machine learning approaches on various datasets, but as far as we know only the MTA-KDD'19 dataset has the peculiarity of updating the representative set of malicious traffic on a daily basis. This daily updating is the added value of the dataset, but it translates into a potential due to the class imbalance problem that the RRw-Optimized MTA-KDD'19 will occur. We capture difficulties of class distribution in real datasets by considering four types of minority class examples: safe, borderline, rare and outliers. In this work, we developed two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem. The first module of G1Nos algorithms performs a coefficient-based instance selection silhouette identifying the critical threshold of Imbalance Degree. (ID), the second module generates synthetic samples using a SMOTE-like oversampling algorithm. The balancing of the classes is done by our G1Nos algorithms to re-establish the proportions between the two classes of the used dataset. The experimental results show that our oversampling algorithm work better than the other two SOTA methodologies in all the metrics considered.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Datasets ; Machine learning ; Malware ; Modules ; Optimization ; Outliers (statistics) ; Oversampling ; Resampling</subject><ispartof>arXiv.org, 2020-12</ispartof><rights>2020. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Letteri, Ivan</creatorcontrib><creatorcontrib>Antonio Di Cecco</creatorcontrib><creatorcontrib>Dyoub, Abeer</creatorcontrib><creatorcontrib>Giuseppe Della Penna</creatorcontrib><title>A Novel Resampling Technique for Imbalanced Dataset Optimization</title><title>arXiv.org</title><description>Despite the enormous amount of data, particular events of interest can still be quite rare. Classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection. Many studies have been developed for malware detection using machine learning approaches on various datasets, but as far as we know only the MTA-KDD'19 dataset has the peculiarity of updating the representative set of malicious traffic on a daily basis. This daily updating is the added value of the dataset, but it translates into a potential due to the class imbalance problem that the RRw-Optimized MTA-KDD'19 will occur. We capture difficulties of class distribution in real datasets by considering four types of minority class examples: safe, borderline, rare and outliers. In this work, we developed two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem. The first module of G1Nos algorithms performs a coefficient-based instance selection silhouette identifying the critical threshold of Imbalance Degree. (ID), the second module generates synthetic samples using a SMOTE-like oversampling algorithm. The balancing of the classes is done by our G1Nos algorithms to re-establish the proportions between the two classes of the used dataset. The experimental results show that our oversampling algorithm work better than the other two SOTA methodologies in all the metrics considered.</description><subject>Algorithms</subject><subject>Datasets</subject><subject>Machine learning</subject><subject>Malware</subject><subject>Modules</subject><subject>Optimization</subject><subject>Outliers (statistics)</subject><subject>Oversampling</subject><subject>Resampling</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRwcFTwyy9LzVEISi1OzC3IycxLVwhJTc7IyywsTVVIyy9S8MxNSsxJzEtOTVFwSSxJLE4tUfAvKMnMzaxKLMnMz-NhYE1LzClO5YXS3AzKbq4hzh66BUX5QCOKS-Kz8kuL8oBS8UYm5iamBhYGJpbGxKkCABQ4OOw</recordid><startdate>20201230</startdate><enddate>20201230</enddate><creator>Letteri, Ivan</creator><creator>Antonio Di Cecco</creator><creator>Dyoub, Abeer</creator><creator>Giuseppe Della Penna</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20201230</creationdate><title>A Novel Resampling Technique for Imbalanced Dataset Optimization</title><author>Letteri, Ivan ; Antonio Di Cecco ; Dyoub, Abeer ; Giuseppe Della Penna</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_24745080493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Algorithms</topic><topic>Datasets</topic><topic>Machine learning</topic><topic>Malware</topic><topic>Modules</topic><topic>Optimization</topic><topic>Outliers (statistics)</topic><topic>Oversampling</topic><topic>Resampling</topic><toplevel>online_resources</toplevel><creatorcontrib>Letteri, Ivan</creatorcontrib><creatorcontrib>Antonio Di Cecco</creatorcontrib><creatorcontrib>Dyoub, Abeer</creatorcontrib><creatorcontrib>Giuseppe Della Penna</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Letteri, Ivan</au><au>Antonio Di Cecco</au><au>Dyoub, Abeer</au><au>Giuseppe Della Penna</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>A Novel Resampling Technique for Imbalanced Dataset Optimization</atitle><jtitle>arXiv.org</jtitle><date>2020-12-30</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Despite the enormous amount of data, particular events of interest can still be quite rare. Classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection. Many studies have been developed for malware detection using machine learning approaches on various datasets, but as far as we know only the MTA-KDD'19 dataset has the peculiarity of updating the representative set of malicious traffic on a daily basis. This daily updating is the added value of the dataset, but it translates into a potential due to the class imbalance problem that the RRw-Optimized MTA-KDD'19 will occur. We capture difficulties of class distribution in real datasets by considering four types of minority class examples: safe, borderline, rare and outliers. In this work, we developed two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem. The first module of G1Nos algorithms performs a coefficient-based instance selection silhouette identifying the critical threshold of Imbalance Degree. (ID), the second module generates synthetic samples using a SMOTE-like oversampling algorithm. The balancing of the classes is done by our G1Nos algorithms to re-establish the proportions between the two classes of the used dataset. The experimental results show that our oversampling algorithm work better than the other two SOTA methodologies in all the metrics considered.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2020-12
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2474508049
source	Free E- Journals
subjects	Algorithms Datasets Machine learning Malware Modules Optimization Outliers (statistics) Oversampling Resampling
title	A Novel Resampling Technique for Imbalanced Dataset Optimization
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T20%3A29%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=A%20Novel%20Resampling%20Technique%20for%20Imbalanced%20Dataset%20Optimization&rft.jtitle=arXiv.org&rft.au=Letteri,%20Ivan&rft.date=2020-12-30&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2474508049%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2474508049&rft_id=info:pmid/&rfr_iscdi=true