Low-Quality Training Data Only? A Robust Framework for Detecting Encrypted Malicious Network Traffic

Machine learning (ML) is promising in accurately detecting malicious flows in encrypted network traffic; however, it is challenging to collect a training dataset that contains a sufficient amount of encrypted malicious data with correct labels. When ML models are trained with low-quality training da...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-09
Hauptverfasser:	Yuqi Qing, Yin, Qilei, Deng, Xinhao, Chen, Yihao, Liu, Zhuotao, Sun, Kun, Xu, Ke, Zhang, Jia, Li, Qi
Format:	Artikel
Sprache:	eng
Schlagworte:	Communications traffic Computer Science - Cryptography and Security Datasets Labels Machine learning Performance degradation Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Yuqi Qing Yin, Qilei Deng, Xinhao Chen, Yihao Liu, Zhuotao Sun, Kun Xu, Ke Zhang, Jia Li, Qi
description	Machine learning (ML) is promising in accurately detecting malicious flows in encrypted network traffic; however, it is challenging to collect a training dataset that contains a sufficient amount of encrypted malicious data with correct labels. When ML models are trained with low-quality training data, they suffer degraded performance. In this paper, we aim at addressing a real-world low-quality training dataset problem, namely, detecting encrypted malicious traffic generated by continuously evolving malware. We develop RAPIER that fully utilizes different distributions of normal and malicious traffic data in the feature space, where normal data is tightly distributed in a certain area and the malicious data is scattered over the entire feature space to augment training data for model training. RAPIER includes two pre-processing modules to convert traffic into feature vectors and correct label noises. We evaluate our system on two public datasets and one combined dataset. With 1000 samples and 45% noises from each dataset, our system achieves the F1 scores of 0.770, 0.776, and 0.855, respectively, achieving average improvements of 352.6%, 284.3%, and 214.9% over the existing methods, respectively. Furthermore, We evaluate RAPIER with a real-world dataset obtained from a security enterprise. RAPIER effectively achieves encrypted malicious traffic detection with the best F1 score of 0.773 and improves the F1 score of existing methods by an average of 272.5%.
doi_str_mv	10.48550/arxiv.2309.04798
format	Article
fullrecord	<record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2309_04798</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2864014393</sourcerecordid><originalsourceid>FETCH-LOGICAL-a523-ca5e9c43736f10081f611d292e56f70ba4395c345b8445bfc7c30bbf67ba832b3</originalsourceid><addsrcrecordid>eNotkE1PAjEQQBsTEwnyAzzZxPNi22l3uydDAD8SlGi4b9rSmiJssdsV99-7gJeZy8ubyUPohpIxl0KQexV__c-YASnHhBelvEADBkAzyRm7QqOm2RBCWF4wIWCA1otwyN5btfWpw6uofO3rTzxTSeFlve0e8AR_BN02CT9GtbOHEL-wCxHPbLImHdl5bWK3T3aNX3uL8aFt8JtNJ7IXOufNNbp0atvY0f8eotXjfDV9zhbLp5fpZJEpwSAzStjScCggd5QQSV1O6ZqVzIrcFUQrDqUwwIWWvB_OFAaI1i4vtJLANAzR7Vl7SlDto9-p2FXHFNUpRU_cnYl9DN-tbVK1CW2s-58qJnNOaH8C4A9fN2FN</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2864014393</pqid></control><display><type>article</type><title>Low-Quality Training Data Only? A Robust Framework for Detecting Encrypted Malicious Network Traffic</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Yuqi Qing ; Yin, Qilei ; Deng, Xinhao ; Chen, Yihao ; Liu, Zhuotao ; Sun, Kun ; Xu, Ke ; Zhang, Jia ; Li, Qi</creator><creatorcontrib>Yuqi Qing ; Yin, Qilei ; Deng, Xinhao ; Chen, Yihao ; Liu, Zhuotao ; Sun, Kun ; Xu, Ke ; Zhang, Jia ; Li, Qi</creatorcontrib><description>Machine learning (ML) is promising in accurately detecting malicious flows in encrypted network traffic; however, it is challenging to collect a training dataset that contains a sufficient amount of encrypted malicious data with correct labels. When ML models are trained with low-quality training data, they suffer degraded performance. In this paper, we aim at addressing a real-world low-quality training dataset problem, namely, detecting encrypted malicious traffic generated by continuously evolving malware. We develop RAPIER that fully utilizes different distributions of normal and malicious traffic data in the feature space, where normal data is tightly distributed in a certain area and the malicious data is scattered over the entire feature space to augment training data for model training. RAPIER includes two pre-processing modules to convert traffic into feature vectors and correct label noises. We evaluate our system on two public datasets and one combined dataset. With 1000 samples and 45% noises from each dataset, our system achieves the F1 scores of 0.770, 0.776, and 0.855, respectively, achieving average improvements of 352.6%, 284.3%, and 214.9% over the existing methods, respectively. Furthermore, We evaluate RAPIER with a real-world dataset obtained from a security enterprise. RAPIER effectively achieves encrypted malicious traffic detection with the best F1 score of 0.773 and improves the F1 score of existing methods by an average of 272.5%.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2309.04798</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Communications traffic ; Computer Science - Cryptography and Security ; Datasets ; Labels ; Machine learning ; Performance degradation ; Training</subject><ispartof>arXiv.org, 2023-09</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,784,885,27925</link.rule.ids><backlink>$$Uhttps://doi.org/10.48550/arXiv.2309.04798$$DView paper in arXiv$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.14722/ndss.2024.23081$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink></links><search><creatorcontrib>Yuqi Qing</creatorcontrib><creatorcontrib>Yin, Qilei</creatorcontrib><creatorcontrib>Deng, Xinhao</creatorcontrib><creatorcontrib>Chen, Yihao</creatorcontrib><creatorcontrib>Liu, Zhuotao</creatorcontrib><creatorcontrib>Sun, Kun</creatorcontrib><creatorcontrib>Xu, Ke</creatorcontrib><creatorcontrib>Zhang, Jia</creatorcontrib><creatorcontrib>Li, Qi</creatorcontrib><title>Low-Quality Training Data Only? A Robust Framework for Detecting Encrypted Malicious Network Traffic</title><title>arXiv.org</title><description>Machine learning (ML) is promising in accurately detecting malicious flows in encrypted network traffic; however, it is challenging to collect a training dataset that contains a sufficient amount of encrypted malicious data with correct labels. When ML models are trained with low-quality training data, they suffer degraded performance. In this paper, we aim at addressing a real-world low-quality training dataset problem, namely, detecting encrypted malicious traffic generated by continuously evolving malware. We develop RAPIER that fully utilizes different distributions of normal and malicious traffic data in the feature space, where normal data is tightly distributed in a certain area and the malicious data is scattered over the entire feature space to augment training data for model training. RAPIER includes two pre-processing modules to convert traffic into feature vectors and correct label noises. We evaluate our system on two public datasets and one combined dataset. With 1000 samples and 45% noises from each dataset, our system achieves the F1 scores of 0.770, 0.776, and 0.855, respectively, achieving average improvements of 352.6%, 284.3%, and 214.9% over the existing methods, respectively. Furthermore, We evaluate RAPIER with a real-world dataset obtained from a security enterprise. RAPIER effectively achieves encrypted malicious traffic detection with the best F1 score of 0.773 and improves the F1 score of existing methods by an average of 272.5%.</description><subject>Communications traffic</subject><subject>Computer Science - Cryptography and Security</subject><subject>Datasets</subject><subject>Labels</subject><subject>Machine learning</subject><subject>Performance degradation</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotkE1PAjEQQBsTEwnyAzzZxPNi22l3uydDAD8SlGi4b9rSmiJssdsV99-7gJeZy8ubyUPohpIxl0KQexV__c-YASnHhBelvEADBkAzyRm7QqOm2RBCWF4wIWCA1otwyN5btfWpw6uofO3rTzxTSeFlve0e8AR_BN02CT9GtbOHEL-wCxHPbLImHdl5bWK3T3aNX3uL8aFt8JtNJ7IXOufNNbp0atvY0f8eotXjfDV9zhbLp5fpZJEpwSAzStjScCggd5QQSV1O6ZqVzIrcFUQrDqUwwIWWvB_OFAaI1i4vtJLANAzR7Vl7SlDto9-p2FXHFNUpRU_cnYl9DN-tbVK1CW2s-58qJnNOaH8C4A9fN2FN</recordid><startdate>20230909</startdate><enddate>20230909</enddate><creator>Yuqi Qing</creator><creator>Yin, Qilei</creator><creator>Deng, Xinhao</creator><creator>Chen, Yihao</creator><creator>Liu, Zhuotao</creator><creator>Sun, Kun</creator><creator>Xu, Ke</creator><creator>Zhang, Jia</creator><creator>Li, Qi</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230909</creationdate><title>Low-Quality Training Data Only? A Robust Framework for Detecting Encrypted Malicious Network Traffic</title><author>Yuqi Qing ; Yin, Qilei ; Deng, Xinhao ; Chen, Yihao ; Liu, Zhuotao ; Sun, Kun ; Xu, Ke ; Zhang, Jia ; Li, Qi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a523-ca5e9c43736f10081f611d292e56f70ba4395c345b8445bfc7c30bbf67ba832b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Communications traffic</topic><topic>Computer Science - Cryptography and Security</topic><topic>Datasets</topic><topic>Labels</topic><topic>Machine learning</topic><topic>Performance degradation</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Yuqi Qing</creatorcontrib><creatorcontrib>Yin, Qilei</creatorcontrib><creatorcontrib>Deng, Xinhao</creatorcontrib><creatorcontrib>Chen, Yihao</creatorcontrib><creatorcontrib>Liu, Zhuotao</creatorcontrib><creatorcontrib>Sun, Kun</creatorcontrib><creatorcontrib>Xu, Ke</creatorcontrib><creatorcontrib>Zhang, Jia</creatorcontrib><creatorcontrib>Li, Qi</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yuqi Qing</au><au>Yin, Qilei</au><au>Deng, Xinhao</au><au>Chen, Yihao</au><au>Liu, Zhuotao</au><au>Sun, Kun</au><au>Xu, Ke</au><au>Zhang, Jia</au><au>Li, Qi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Low-Quality Training Data Only? A Robust Framework for Detecting Encrypted Malicious Network Traffic</atitle><jtitle>arXiv.org</jtitle><date>2023-09-09</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Machine learning (ML) is promising in accurately detecting malicious flows in encrypted network traffic; however, it is challenging to collect a training dataset that contains a sufficient amount of encrypted malicious data with correct labels. When ML models are trained with low-quality training data, they suffer degraded performance. In this paper, we aim at addressing a real-world low-quality training dataset problem, namely, detecting encrypted malicious traffic generated by continuously evolving malware. We develop RAPIER that fully utilizes different distributions of normal and malicious traffic data in the feature space, where normal data is tightly distributed in a certain area and the malicious data is scattered over the entire feature space to augment training data for model training. RAPIER includes two pre-processing modules to convert traffic into feature vectors and correct label noises. We evaluate our system on two public datasets and one combined dataset. With 1000 samples and 45% noises from each dataset, our system achieves the F1 scores of 0.770, 0.776, and 0.855, respectively, achieving average improvements of 352.6%, 284.3%, and 214.9% over the existing methods, respectively. Furthermore, We evaluate RAPIER with a real-world dataset obtained from a security enterprise. RAPIER effectively achieves encrypted malicious traffic detection with the best F1 score of 0.773 and improves the F1 score of existing methods by an average of 272.5%.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2309.04798</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-09
issn	2331-8422
language	eng
recordid	cdi_arxiv_primary_2309_04798
source	arXiv.org; Free E- Journals
subjects	Communications traffic Computer Science - Cryptography and Security Datasets Labels Machine learning Performance degradation Training
title	Low-Quality Training Data Only? A Robust Framework for Detecting Encrypted Malicious Network Traffic
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T22%3A14%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Low-Quality%20Training%20Data%20Only?%20A%20Robust%20Framework%20for%20Detecting%20Encrypted%20Malicious%20Network%20Traffic&rft.jtitle=arXiv.org&rft.au=Yuqi%20Qing&rft.date=2023-09-09&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2309.04798&rft_dat=%3Cproquest_arxiv%3E2864014393%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2864014393&rft_id=info:pmid/&rfr_iscdi=true