Feature Engineering and Machine Learning Model Comparison for Malicious Activity Detection in the DNS-Over-HTTPS Protocol

The Domain Name System (DNS) is among the most ubiquitous and important protocols for network communication; however, security concerns regarding DNS have been on the rise and demand for encrypted traffic has followed suit. Using a publicly available dataset, this work compares 10 different machine...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2021, Vol.9, p.129902-129916
Hauptverfasser: Behnke, Matthew, Briner, Nathan, Cullen, Drake, Schwerdtfeger, Katelynn, Warren, Jackson, Basnet, Ram, Doleck, Tenzin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 129916
container_issue
container_start_page 129902
container_title IEEE access
container_volume 9
creator Behnke, Matthew
Briner, Nathan
Cullen, Drake
Schwerdtfeger, Katelynn
Warren, Jackson
Basnet, Ram
Doleck, Tenzin
description The Domain Name System (DNS) is among the most ubiquitous and important protocols for network communication; however, security concerns regarding DNS have been on the rise and demand for encrypted traffic has followed suit. Using a publicly available dataset, this work compares 10 different machine learning classifiers using stratified 10-fold cross-validation. The classifiers are used to determine the most effective and efficient way of detecting malicious DNS over Hypertext Transfer Protocol Secure (HTTPS) traffic, dubbed DoH traffic. Model performance is evaluated on Non-DoH vs. DoH traffic, then tested on benign vs. malicious DoH traffic. Additionally, this paper seeks to build upon existing research by removing noise and introducing feature selection methods and feature explainability to produce a better model for real-world deployment. After eliminating five overfitting features, our findings indicate that light gradient boosting machine (LGBM) yielded the highest accuracy to training time ratio while approaching 0% error using 20 top features.
doi_str_mv 10.1109/ACCESS.2021.3113294
format Article
fullrecord <record><control><sourceid>proquest_ieee_</sourceid><recordid>TN_cdi_ieee_primary_9540699</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9540699</ieee_id><doaj_id>oai_doaj_org_article_e5a960db73b2434d91f1e08d65eaa227</doaj_id><sourcerecordid>2577554663</sourcerecordid><originalsourceid>FETCH-LOGICAL-c474t-27a8e8b903ea61bf1d1ab34f2ed890913d0a8c71e65de8710e0b6bb5295279f63</originalsourceid><addsrcrecordid>eNpNUV1r3DAQNKWBhjS_IC-CPvuqD0uyHg_n0gQuH3DXZ7G21hcdjnWVdYH799XVIXRfdhhmZhemKG4YXTBGzc9l06w2mwWnnC0EY4Kb6ktxyZkypZBCff0Pfyuup2lP89SZkvqyON0hpGNEshp3fkSMftwRGB15hO41E2SNEMcz-RgcDqQJbweIfgoj6UPMqsF3PhwnsuySf_fpRG4xYcZZ4EeSXpHcPm3K53eM5f12-7IhLzGk0IXhe3HRwzDh9ce-Kn7frbbNfbl-_vXQLNdlV-kqlVxDjXVrqEBQrO2ZY9CKqufoakMNE45C3WmGSjqsNaNIW9W2khvJtemVuCoe5lwXYG8P0b9BPNkA3v4jQtxZiMl3A1qUYBR1rRYtr0TlDOsZ0topiQCc65z1Y846xPDniFOy-3CMY37fcqm1lJVSIqvErOpimKaI_edVRu25MjtXZs-V2Y_KsutmdnlE_HQYWVFljPgLC36R7w</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2577554663</pqid></control><display><type>article</type><title>Feature Engineering and Machine Learning Model Comparison for Malicious Activity Detection in the DNS-Over-HTTPS Protocol</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Behnke, Matthew ; Briner, Nathan ; Cullen, Drake ; Schwerdtfeger, Katelynn ; Warren, Jackson ; Basnet, Ram ; Doleck, Tenzin</creator><creatorcontrib>Behnke, Matthew ; Briner, Nathan ; Cullen, Drake ; Schwerdtfeger, Katelynn ; Warren, Jackson ; Basnet, Ram ; Doleck, Tenzin</creatorcontrib><description>The Domain Name System (DNS) is among the most ubiquitous and important protocols for network communication; however, security concerns regarding DNS have been on the rise and demand for encrypted traffic has followed suit. Using a publicly available dataset, this work compares 10 different machine learning classifiers using stratified 10-fold cross-validation. The classifiers are used to determine the most effective and efficient way of detecting malicious DNS over Hypertext Transfer Protocol Secure (HTTPS) traffic, dubbed DoH traffic. Model performance is evaluated on Non-DoH vs. DoH traffic, then tested on benign vs. malicious DoH traffic. Additionally, this paper seeks to build upon existing research by removing noise and introducing feature selection methods and feature explainability to produce a better model for real-world deployment. After eliminating five overfitting features, our findings indicate that light gradient boosting machine (LGBM) yielded the highest accuracy to training time ratio while approaching 0% error using 20 top features.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2021.3113294</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Browsers ; Chi-square test ; Chi-squared ; Classifiers ; decision tree ; DNS ; DoH ; Domain names ; Feature extraction ; Hypertext ; IP networks ; LGBM ; Machine learning ; pearson correlation ; Privacy ; Protocols ; random forest ; Security ; sequential forward selection ; Servers ; Traffic models ; XGBM</subject><ispartof>IEEE access, 2021, Vol.9, p.129902-129916</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c474t-27a8e8b903ea61bf1d1ab34f2ed890913d0a8c71e65de8710e0b6bb5295279f63</citedby><cites>FETCH-LOGICAL-c474t-27a8e8b903ea61bf1d1ab34f2ed890913d0a8c71e65de8710e0b6bb5295279f63</cites><orcidid>0000-0001-9086-5718 ; 0000-0001-6864-6893 ; 0000-0002-0000-8307 ; 0000-0001-6529-5493</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9540699$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,864,2102,4024,27633,27923,27924,27925,54933</link.rule.ids></links><search><creatorcontrib>Behnke, Matthew</creatorcontrib><creatorcontrib>Briner, Nathan</creatorcontrib><creatorcontrib>Cullen, Drake</creatorcontrib><creatorcontrib>Schwerdtfeger, Katelynn</creatorcontrib><creatorcontrib>Warren, Jackson</creatorcontrib><creatorcontrib>Basnet, Ram</creatorcontrib><creatorcontrib>Doleck, Tenzin</creatorcontrib><title>Feature Engineering and Machine Learning Model Comparison for Malicious Activity Detection in the DNS-Over-HTTPS Protocol</title><title>IEEE access</title><addtitle>Access</addtitle><description>The Domain Name System (DNS) is among the most ubiquitous and important protocols for network communication; however, security concerns regarding DNS have been on the rise and demand for encrypted traffic has followed suit. Using a publicly available dataset, this work compares 10 different machine learning classifiers using stratified 10-fold cross-validation. The classifiers are used to determine the most effective and efficient way of detecting malicious DNS over Hypertext Transfer Protocol Secure (HTTPS) traffic, dubbed DoH traffic. Model performance is evaluated on Non-DoH vs. DoH traffic, then tested on benign vs. malicious DoH traffic. Additionally, this paper seeks to build upon existing research by removing noise and introducing feature selection methods and feature explainability to produce a better model for real-world deployment. After eliminating five overfitting features, our findings indicate that light gradient boosting machine (LGBM) yielded the highest accuracy to training time ratio while approaching 0% error using 20 top features.</description><subject>Browsers</subject><subject>Chi-square test</subject><subject>Chi-squared</subject><subject>Classifiers</subject><subject>decision tree</subject><subject>DNS</subject><subject>DoH</subject><subject>Domain names</subject><subject>Feature extraction</subject><subject>Hypertext</subject><subject>IP networks</subject><subject>LGBM</subject><subject>Machine learning</subject><subject>pearson correlation</subject><subject>Privacy</subject><subject>Protocols</subject><subject>random forest</subject><subject>Security</subject><subject>sequential forward selection</subject><subject>Servers</subject><subject>Traffic models</subject><subject>XGBM</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUV1r3DAQNKWBhjS_IC-CPvuqD0uyHg_n0gQuH3DXZ7G21hcdjnWVdYH799XVIXRfdhhmZhemKG4YXTBGzc9l06w2mwWnnC0EY4Kb6ktxyZkypZBCff0Pfyuup2lP89SZkvqyON0hpGNEshp3fkSMftwRGB15hO41E2SNEMcz-RgcDqQJbweIfgoj6UPMqsF3PhwnsuySf_fpRG4xYcZZ4EeSXpHcPm3K53eM5f12-7IhLzGk0IXhe3HRwzDh9ce-Kn7frbbNfbl-_vXQLNdlV-kqlVxDjXVrqEBQrO2ZY9CKqufoakMNE45C3WmGSjqsNaNIW9W2khvJtemVuCoe5lwXYG8P0b9BPNkA3v4jQtxZiMl3A1qUYBR1rRYtr0TlDOsZ0topiQCc65z1Y846xPDniFOy-3CMY37fcqm1lJVSIqvErOpimKaI_edVRu25MjtXZs-V2Y_KsutmdnlE_HQYWVFljPgLC36R7w</recordid><startdate>2021</startdate><enddate>2021</enddate><creator>Behnke, Matthew</creator><creator>Briner, Nathan</creator><creator>Cullen, Drake</creator><creator>Schwerdtfeger, Katelynn</creator><creator>Warren, Jackson</creator><creator>Basnet, Ram</creator><creator>Doleck, Tenzin</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0001-9086-5718</orcidid><orcidid>https://orcid.org/0000-0001-6864-6893</orcidid><orcidid>https://orcid.org/0000-0002-0000-8307</orcidid><orcidid>https://orcid.org/0000-0001-6529-5493</orcidid></search><sort><creationdate>2021</creationdate><title>Feature Engineering and Machine Learning Model Comparison for Malicious Activity Detection in the DNS-Over-HTTPS Protocol</title><author>Behnke, Matthew ; Briner, Nathan ; Cullen, Drake ; Schwerdtfeger, Katelynn ; Warren, Jackson ; Basnet, Ram ; Doleck, Tenzin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c474t-27a8e8b903ea61bf1d1ab34f2ed890913d0a8c71e65de8710e0b6bb5295279f63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Browsers</topic><topic>Chi-square test</topic><topic>Chi-squared</topic><topic>Classifiers</topic><topic>decision tree</topic><topic>DNS</topic><topic>DoH</topic><topic>Domain names</topic><topic>Feature extraction</topic><topic>Hypertext</topic><topic>IP networks</topic><topic>LGBM</topic><topic>Machine learning</topic><topic>pearson correlation</topic><topic>Privacy</topic><topic>Protocols</topic><topic>random forest</topic><topic>Security</topic><topic>sequential forward selection</topic><topic>Servers</topic><topic>Traffic models</topic><topic>XGBM</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Behnke, Matthew</creatorcontrib><creatorcontrib>Briner, Nathan</creatorcontrib><creatorcontrib>Cullen, Drake</creatorcontrib><creatorcontrib>Schwerdtfeger, Katelynn</creatorcontrib><creatorcontrib>Warren, Jackson</creatorcontrib><creatorcontrib>Basnet, Ram</creatorcontrib><creatorcontrib>Doleck, Tenzin</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Behnke, Matthew</au><au>Briner, Nathan</au><au>Cullen, Drake</au><au>Schwerdtfeger, Katelynn</au><au>Warren, Jackson</au><au>Basnet, Ram</au><au>Doleck, Tenzin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Feature Engineering and Machine Learning Model Comparison for Malicious Activity Detection in the DNS-Over-HTTPS Protocol</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2021</date><risdate>2021</risdate><volume>9</volume><spage>129902</spage><epage>129916</epage><pages>129902-129916</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>The Domain Name System (DNS) is among the most ubiquitous and important protocols for network communication; however, security concerns regarding DNS have been on the rise and demand for encrypted traffic has followed suit. Using a publicly available dataset, this work compares 10 different machine learning classifiers using stratified 10-fold cross-validation. The classifiers are used to determine the most effective and efficient way of detecting malicious DNS over Hypertext Transfer Protocol Secure (HTTPS) traffic, dubbed DoH traffic. Model performance is evaluated on Non-DoH vs. DoH traffic, then tested on benign vs. malicious DoH traffic. Additionally, this paper seeks to build upon existing research by removing noise and introducing feature selection methods and feature explainability to produce a better model for real-world deployment. After eliminating five overfitting features, our findings indicate that light gradient boosting machine (LGBM) yielded the highest accuracy to training time ratio while approaching 0% error using 20 top features.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2021.3113294</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0001-9086-5718</orcidid><orcidid>https://orcid.org/0000-0001-6864-6893</orcidid><orcidid>https://orcid.org/0000-0002-0000-8307</orcidid><orcidid>https://orcid.org/0000-0001-6529-5493</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2021, Vol.9, p.129902-129916
issn 2169-3536
2169-3536
language eng
recordid cdi_ieee_primary_9540699
source IEEE Open Access Journals; DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals
subjects Browsers
Chi-square test
Chi-squared
Classifiers
decision tree
DNS
DoH
Domain names
Feature extraction
Hypertext
IP networks
LGBM
Machine learning
pearson correlation
Privacy
Protocols
random forest
Security
sequential forward selection
Servers
Traffic models
XGBM
title Feature Engineering and Machine Learning Model Comparison for Malicious Activity Detection in the DNS-Over-HTTPS Protocol
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-22T02%3A48%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_ieee_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Feature%20Engineering%20and%20Machine%20Learning%20Model%20Comparison%20for%20Malicious%20Activity%20Detection%20in%20the%20DNS-Over-HTTPS%20Protocol&rft.jtitle=IEEE%20access&rft.au=Behnke,%20Matthew&rft.date=2021&rft.volume=9&rft.spage=129902&rft.epage=129916&rft.pages=129902-129916&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2021.3113294&rft_dat=%3Cproquest_ieee_%3E2577554663%3C/proquest_ieee_%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2577554663&rft_id=info:pmid/&rft_ieee_id=9540699&rft_doaj_id=oai_doaj_org_article_e5a960db73b2434d91f1e08d65eaa227&rfr_iscdi=true