A Novel Resampling Technique for Imbalanced Dataset Optimization
Despite the enormous amount of data, particular events of interest can still be quite rare. Classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection. Many studies have been developed for malware detec...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2020-12 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Letteri, Ivan Antonio Di Cecco Dyoub, Abeer Giuseppe Della Penna |
description | Despite the enormous amount of data, particular events of interest can still be quite rare. Classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection. Many studies have been developed for malware detection using machine learning approaches on various datasets, but as far as we know only the MTA-KDD'19 dataset has the peculiarity of updating the representative set of malicious traffic on a daily basis. This daily updating is the added value of the dataset, but it translates into a potential due to the class imbalance problem that the RRw-Optimized MTA-KDD'19 will occur. We capture difficulties of class distribution in real datasets by considering four types of minority class examples: safe, borderline, rare and outliers. In this work, we developed two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem. The first module of G1Nos algorithms performs a coefficient-based instance selection silhouette identifying the critical threshold of Imbalance Degree. (ID), the second module generates synthetic samples using a SMOTE-like oversampling algorithm. The balancing of the classes is done by our G1Nos algorithms to re-establish the proportions between the two classes of the used dataset. The experimental results show that our oversampling algorithm work better than the other two SOTA methodologies in all the metrics considered. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2474508049</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2474508049</sourcerecordid><originalsourceid>FETCH-proquest_journals_24745080493</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRwcFTwyy9LzVEISi1OzC3IycxLVwhJTc7IyywsTVVIyy9S8MxNSsxJzEtOTVFwSSxJLE4tUfAvKMnMzaxKLMnMz-NhYE1LzClO5YXS3AzKbq4hzh66BUX5QCOKS-Kz8kuL8oBS8UYm5iamBhYGJpbGxKkCABQ4OOw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2474508049</pqid></control><display><type>article</type><title>A Novel Resampling Technique for Imbalanced Dataset Optimization</title><source>Free E- Journals</source><creator>Letteri, Ivan ; Antonio Di Cecco ; Dyoub, Abeer ; Giuseppe Della Penna</creator><creatorcontrib>Letteri, Ivan ; Antonio Di Cecco ; Dyoub, Abeer ; Giuseppe Della Penna</creatorcontrib><description>Despite the enormous amount of data, particular events of interest can still be quite rare. Classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection. Many studies have been developed for malware detection using machine learning approaches on various datasets, but as far as we know only the MTA-KDD'19 dataset has the peculiarity of updating the representative set of malicious traffic on a daily basis. This daily updating is the added value of the dataset, but it translates into a potential due to the class imbalance problem that the RRw-Optimized MTA-KDD'19 will occur. We capture difficulties of class distribution in real datasets by considering four types of minority class examples: safe, borderline, rare and outliers. In this work, we developed two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem. The first module of G1Nos algorithms performs a coefficient-based instance selection silhouette identifying the critical threshold of Imbalance Degree. (ID), the second module generates synthetic samples using a SMOTE-like oversampling algorithm. The balancing of the classes is done by our G1Nos algorithms to re-establish the proportions between the two classes of the used dataset. The experimental results show that our oversampling algorithm work better than the other two SOTA methodologies in all the metrics considered.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Datasets ; Machine learning ; Malware ; Modules ; Optimization ; Outliers (statistics) ; Oversampling ; Resampling</subject><ispartof>arXiv.org, 2020-12</ispartof><rights>2020. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Letteri, Ivan</creatorcontrib><creatorcontrib>Antonio Di Cecco</creatorcontrib><creatorcontrib>Dyoub, Abeer</creatorcontrib><creatorcontrib>Giuseppe Della Penna</creatorcontrib><title>A Novel Resampling Technique for Imbalanced Dataset Optimization</title><title>arXiv.org</title><description>Despite the enormous amount of data, particular events of interest can still be quite rare. Classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection. Many studies have been developed for malware detection using machine learning approaches on various datasets, but as far as we know only the MTA-KDD'19 dataset has the peculiarity of updating the representative set of malicious traffic on a daily basis. This daily updating is the added value of the dataset, but it translates into a potential due to the class imbalance problem that the RRw-Optimized MTA-KDD'19 will occur. We capture difficulties of class distribution in real datasets by considering four types of minority class examples: safe, borderline, rare and outliers. In this work, we developed two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem. The first module of G1Nos algorithms performs a coefficient-based instance selection silhouette identifying the critical threshold of Imbalance Degree. (ID), the second module generates synthetic samples using a SMOTE-like oversampling algorithm. The balancing of the classes is done by our G1Nos algorithms to re-establish the proportions between the two classes of the used dataset. The experimental results show that our oversampling algorithm work better than the other two SOTA methodologies in all the metrics considered.</description><subject>Algorithms</subject><subject>Datasets</subject><subject>Machine learning</subject><subject>Malware</subject><subject>Modules</subject><subject>Optimization</subject><subject>Outliers (statistics)</subject><subject>Oversampling</subject><subject>Resampling</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRwcFTwyy9LzVEISi1OzC3IycxLVwhJTc7IyywsTVVIyy9S8MxNSsxJzEtOTVFwSSxJLE4tUfAvKMnMzaxKLMnMz-NhYE1LzClO5YXS3AzKbq4hzh66BUX5QCOKS-Kz8kuL8oBS8UYm5iamBhYGJpbGxKkCABQ4OOw</recordid><startdate>20201230</startdate><enddate>20201230</enddate><creator>Letteri, Ivan</creator><creator>Antonio Di Cecco</creator><creator>Dyoub, Abeer</creator><creator>Giuseppe Della Penna</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20201230</creationdate><title>A Novel Resampling Technique for Imbalanced Dataset Optimization</title><author>Letteri, Ivan ; Antonio Di Cecco ; Dyoub, Abeer ; Giuseppe Della Penna</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_24745080493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Algorithms</topic><topic>Datasets</topic><topic>Machine learning</topic><topic>Malware</topic><topic>Modules</topic><topic>Optimization</topic><topic>Outliers (statistics)</topic><topic>Oversampling</topic><topic>Resampling</topic><toplevel>online_resources</toplevel><creatorcontrib>Letteri, Ivan</creatorcontrib><creatorcontrib>Antonio Di Cecco</creatorcontrib><creatorcontrib>Dyoub, Abeer</creatorcontrib><creatorcontrib>Giuseppe Della Penna</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Letteri, Ivan</au><au>Antonio Di Cecco</au><au>Dyoub, Abeer</au><au>Giuseppe Della Penna</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>A Novel Resampling Technique for Imbalanced Dataset Optimization</atitle><jtitle>arXiv.org</jtitle><date>2020-12-30</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Despite the enormous amount of data, particular events of interest can still be quite rare. Classification of rare events is a common problem in many domains, such as fraudulent transactions, malware traffic analysis and network intrusion detection. Many studies have been developed for malware detection using machine learning approaches on various datasets, but as far as we know only the MTA-KDD'19 dataset has the peculiarity of updating the representative set of malicious traffic on a daily basis. This daily updating is the added value of the dataset, but it translates into a potential due to the class imbalance problem that the RRw-Optimized MTA-KDD'19 will occur. We capture difficulties of class distribution in real datasets by considering four types of minority class examples: safe, borderline, rare and outliers. In this work, we developed two versions of Generative Silhouette Resampling 1-Nearest Neighbour (G1Nos) oversampling algorithms for dealing with class imbalance problem. The first module of G1Nos algorithms performs a coefficient-based instance selection silhouette identifying the critical threshold of Imbalance Degree. (ID), the second module generates synthetic samples using a SMOTE-like oversampling algorithm. The balancing of the classes is done by our G1Nos algorithms to re-establish the proportions between the two classes of the used dataset. The experimental results show that our oversampling algorithm work better than the other two SOTA methodologies in all the metrics considered.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2020-12 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2474508049 |
source | Free E- Journals |
subjects | Algorithms Datasets Machine learning Malware Modules Optimization Outliers (statistics) Oversampling Resampling |
title | A Novel Resampling Technique for Imbalanced Dataset Optimization |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T20%3A29%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=A%20Novel%20Resampling%20Technique%20for%20Imbalanced%20Dataset%20Optimization&rft.jtitle=arXiv.org&rft.au=Letteri,%20Ivan&rft.date=2020-12-30&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2474508049%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2474508049&rft_id=info:pmid/&rfr_iscdi=true |