Malicious URL Detection using optimized Hist Gradient Boosting Classifier based on grid search method

Trusting the accuracy of data inputted on online platforms can be difficult due to the possibility of malicious websites gathering information for unlawful reasons. Analyzing each website individually becomes challenging with the presence of such malicious sites, making it hard to efficiently list a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-06
Hauptverfasser:	Maftoun, Mohammad, Shadkam, Nima, Seyedeh Somayeh Salehi Komamardakhi, Mansor, Zulkefli, Joloudari, Javad Hassannataj
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Classifiers Data collection Datasets Decision trees Machine learning Multilayer perceptrons Multilayers Search methods Support vector machines Websites
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Maftoun, Mohammad Shadkam, Nima Seyedeh Somayeh Salehi Komamardakhi Mansor, Zulkefli Joloudari, Javad Hassannataj
description	Trusting the accuracy of data inputted on online platforms can be difficult due to the possibility of malicious websites gathering information for unlawful reasons. Analyzing each website individually becomes challenging with the presence of such malicious sites, making it hard to efficiently list all Uniform Resource Locators (URLs) on a blacklist. This ongoing challenge emphasizes the crucial need for strong security measures to safeguard against potential threats and unauthorized data collection. To detect the risk posed by malicious websites, it is proposed to utilize Machine Learning (ML)-based techniques. To this, we used several ML techniques such as Hist Gradient Boosting Classifier (HGBC), K-Nearest Neighbor (KNN), Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP), Light Gradient Boosting Machine (LGBM), and Support Vector Machine (SVM) for detection of the benign and malicious website dataset. The dataset used contains 1781 records of malicious and benign website data with 13 features. First, we investigated missing value imputation on the dataset. Then, we normalized this data by scaling to a range of zero and one. Next, we utilized the Synthetic Minority Oversampling Technique (SMOTE) to balance the training data since the data set was unbalanced. After that, we applied ML algorithms to the balanced training set. Meanwhile, all algorithms were optimized based on grid search. Finally, the models were evaluated based on accuracy, precision, recall, F1 score, and the Area Under the Curve (AUC) metrics. The results demonstrated that the HGBC classifier has the best performance in terms of the mentioned metrics compared to the other classifiers.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3069344941</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3069344941</sourcerecordid><originalsourceid>FETCH-proquest_journals_30693449413</originalsourceid><addsrcrecordid>eNqNy7sOgkAQheGNiYlGeYdJrElgF1Bb74U2RmuywgBjgMWdpfHpxcQHsDrF_52RmEqlQn8VSTkRHvMzCAKZLGUcq6nAi64pI9Mz3K9n2KHDzJFpoWdqSzCdo4bemMOJ2MHR6pywdbAxht0XbGvNTAWhhYfmwQ3X0lIOjNpmFTToKpPPxbjQNaP325lYHPa37cnvrHn1yC59mt62Q0pVkKxVFK2jUP2nPlYURpw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3069344941</pqid></control><display><type>article</type><title>Malicious URL Detection using optimized Hist Gradient Boosting Classifier based on grid search method</title><source>Freely Accessible Journals at publisher websites</source><creator>Maftoun, Mohammad ; Shadkam, Nima ; Seyedeh Somayeh Salehi Komamardakhi ; Mansor, Zulkefli ; Joloudari, Javad Hassannataj</creator><creatorcontrib>Maftoun, Mohammad ; Shadkam, Nima ; Seyedeh Somayeh Salehi Komamardakhi ; Mansor, Zulkefli ; Joloudari, Javad Hassannataj</creatorcontrib><description>Trusting the accuracy of data inputted on online platforms can be difficult due to the possibility of malicious websites gathering information for unlawful reasons. Analyzing each website individually becomes challenging with the presence of such malicious sites, making it hard to efficiently list all Uniform Resource Locators (URLs) on a blacklist. This ongoing challenge emphasizes the crucial need for strong security measures to safeguard against potential threats and unauthorized data collection. To detect the risk posed by malicious websites, it is proposed to utilize Machine Learning (ML)-based techniques. To this, we used several ML techniques such as Hist Gradient Boosting Classifier (HGBC), K-Nearest Neighbor (KNN), Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP), Light Gradient Boosting Machine (LGBM), and Support Vector Machine (SVM) for detection of the benign and malicious website dataset. The dataset used contains 1781 records of malicious and benign website data with 13 features. First, we investigated missing value imputation on the dataset. Then, we normalized this data by scaling to a range of zero and one. Next, we utilized the Synthetic Minority Oversampling Technique (SMOTE) to balance the training data since the data set was unbalanced. After that, we applied ML algorithms to the balanced training set. Meanwhile, all algorithms were optimized based on grid search. Finally, the models were evaluated based on accuracy, precision, recall, F1 score, and the Area Under the Curve (AUC) metrics. The results demonstrated that the HGBC classifier has the best performance in terms of the mentioned metrics compared to the other classifiers.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Classifiers ; Data collection ; Datasets ; Decision trees ; Machine learning ; Multilayer perceptrons ; Multilayers ; Search methods ; Support vector machines ; Websites</subject><ispartof>arXiv.org, 2024-06</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-nc-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Maftoun, Mohammad</creatorcontrib><creatorcontrib>Shadkam, Nima</creatorcontrib><creatorcontrib>Seyedeh Somayeh Salehi Komamardakhi</creatorcontrib><creatorcontrib>Mansor, Zulkefli</creatorcontrib><creatorcontrib>Joloudari, Javad Hassannataj</creatorcontrib><title>Malicious URL Detection using optimized Hist Gradient Boosting Classifier based on grid search method</title><title>arXiv.org</title><description>Trusting the accuracy of data inputted on online platforms can be difficult due to the possibility of malicious websites gathering information for unlawful reasons. Analyzing each website individually becomes challenging with the presence of such malicious sites, making it hard to efficiently list all Uniform Resource Locators (URLs) on a blacklist. This ongoing challenge emphasizes the crucial need for strong security measures to safeguard against potential threats and unauthorized data collection. To detect the risk posed by malicious websites, it is proposed to utilize Machine Learning (ML)-based techniques. To this, we used several ML techniques such as Hist Gradient Boosting Classifier (HGBC), K-Nearest Neighbor (KNN), Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP), Light Gradient Boosting Machine (LGBM), and Support Vector Machine (SVM) for detection of the benign and malicious website dataset. The dataset used contains 1781 records of malicious and benign website data with 13 features. First, we investigated missing value imputation on the dataset. Then, we normalized this data by scaling to a range of zero and one. Next, we utilized the Synthetic Minority Oversampling Technique (SMOTE) to balance the training data since the data set was unbalanced. After that, we applied ML algorithms to the balanced training set. Meanwhile, all algorithms were optimized based on grid search. Finally, the models were evaluated based on accuracy, precision, recall, F1 score, and the Area Under the Curve (AUC) metrics. The results demonstrated that the HGBC classifier has the best performance in terms of the mentioned metrics compared to the other classifiers.</description><subject>Algorithms</subject><subject>Classifiers</subject><subject>Data collection</subject><subject>Datasets</subject><subject>Decision trees</subject><subject>Machine learning</subject><subject>Multilayer perceptrons</subject><subject>Multilayers</subject><subject>Search methods</subject><subject>Support vector machines</subject><subject>Websites</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNy7sOgkAQheGNiYlGeYdJrElgF1Bb74U2RmuywgBjgMWdpfHpxcQHsDrF_52RmEqlQn8VSTkRHvMzCAKZLGUcq6nAi64pI9Mz3K9n2KHDzJFpoWdqSzCdo4bemMOJ2MHR6pywdbAxht0XbGvNTAWhhYfmwQ3X0lIOjNpmFTToKpPPxbjQNaP325lYHPa37cnvrHn1yC59mt62Q0pVkKxVFK2jUP2nPlYURpw</recordid><startdate>20240612</startdate><enddate>20240612</enddate><creator>Maftoun, Mohammad</creator><creator>Shadkam, Nima</creator><creator>Seyedeh Somayeh Salehi Komamardakhi</creator><creator>Mansor, Zulkefli</creator><creator>Joloudari, Javad Hassannataj</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240612</creationdate><title>Malicious URL Detection using optimized Hist Gradient Boosting Classifier based on grid search method</title><author>Maftoun, Mohammad ; Shadkam, Nima ; Seyedeh Somayeh Salehi Komamardakhi ; Mansor, Zulkefli ; Joloudari, Javad Hassannataj</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30693449413</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algorithms</topic><topic>Classifiers</topic><topic>Data collection</topic><topic>Datasets</topic><topic>Decision trees</topic><topic>Machine learning</topic><topic>Multilayer perceptrons</topic><topic>Multilayers</topic><topic>Search methods</topic><topic>Support vector machines</topic><topic>Websites</topic><toplevel>online_resources</toplevel><creatorcontrib>Maftoun, Mohammad</creatorcontrib><creatorcontrib>Shadkam, Nima</creatorcontrib><creatorcontrib>Seyedeh Somayeh Salehi Komamardakhi</creatorcontrib><creatorcontrib>Mansor, Zulkefli</creatorcontrib><creatorcontrib>Joloudari, Javad Hassannataj</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>ProQuest Publicly Available Content database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Maftoun, Mohammad</au><au>Shadkam, Nima</au><au>Seyedeh Somayeh Salehi Komamardakhi</au><au>Mansor, Zulkefli</au><au>Joloudari, Javad Hassannataj</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Malicious URL Detection using optimized Hist Gradient Boosting Classifier based on grid search method</atitle><jtitle>arXiv.org</jtitle><date>2024-06-12</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Trusting the accuracy of data inputted on online platforms can be difficult due to the possibility of malicious websites gathering information for unlawful reasons. Analyzing each website individually becomes challenging with the presence of such malicious sites, making it hard to efficiently list all Uniform Resource Locators (URLs) on a blacklist. This ongoing challenge emphasizes the crucial need for strong security measures to safeguard against potential threats and unauthorized data collection. To detect the risk posed by malicious websites, it is proposed to utilize Machine Learning (ML)-based techniques. To this, we used several ML techniques such as Hist Gradient Boosting Classifier (HGBC), K-Nearest Neighbor (KNN), Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP), Light Gradient Boosting Machine (LGBM), and Support Vector Machine (SVM) for detection of the benign and malicious website dataset. The dataset used contains 1781 records of malicious and benign website data with 13 features. First, we investigated missing value imputation on the dataset. Then, we normalized this data by scaling to a range of zero and one. Next, we utilized the Synthetic Minority Oversampling Technique (SMOTE) to balance the training data since the data set was unbalanced. After that, we applied ML algorithms to the balanced training set. Meanwhile, all algorithms were optimized based on grid search. Finally, the models were evaluated based on accuracy, precision, recall, F1 score, and the Area Under the Curve (AUC) metrics. The results demonstrated that the HGBC classifier has the best performance in terms of the mentioned metrics compared to the other classifiers.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3069344941
source	Freely Accessible Journals at publisher websites
subjects	Algorithms Classifiers Data collection Datasets Decision trees Machine learning Multilayer perceptrons Multilayers Search methods Support vector machines Websites
title	Malicious URL Detection using optimized Hist Gradient Boosting Classifier based on grid search method
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T05%3A56%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Malicious%20URL%20Detection%20using%20optimized%20Hist%20Gradient%20Boosting%20Classifier%20based%20on%20grid%20search%20method&rft.jtitle=arXiv.org&rft.au=Maftoun,%20Mohammad&rft.date=2024-06-12&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3069344941%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3069344941&rft_id=info:pmid/&rfr_iscdi=true