Detection of malicious URLs based on word vector representation and ngram

Most Intrusion Detection Systems (IDS) nowadays are signature-based. They are very useful and accurate for detecting known attacks. However, they are inefficient with unknown attacks. Moreover, most of cyber attacks start with malicious URLs. Attackers try to trick users into clicking on malicious U...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of intelligent & fuzzy systems 2018-01, Vol.35 (6), p.5889-5900
Hauptverfasser:	Hai, Quan Tran, Hwang, Seong Oun
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial intelligence Blacklisting Cybersecurity Experimentation Intrusion detection systems Machine learning Natural language processing Representations Websites
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	5900
container_issue	6
container_start_page	5889
container_title	Journal of intelligent & fuzzy systems
container_volume	35
creator	Hai, Quan Tran Hwang, Seong Oun
description	Most Intrusion Detection Systems (IDS) nowadays are signature-based. They are very useful and accurate for detecting known attacks. However, they are inefficient with unknown attacks. Moreover, most of cyber attacks start with malicious URLs. Attackers try to trick users into clicking on malicious URLs. This gives attackers an easy way to launch attacks. To defend against this, companies and organizations use IDS/IPS to detect malicous URLs using blacklist of URLs. This is very efficient with known malicious URLs, but useless with unknown malicious URLs. To overcome this problem, a number of malicious Web site detection systems have been proposed. One of the most promising methods is to apply machine learning detection techniques. In this paper, we present a new lexical approach to classify URLs by using machine learning techniques which patternize the URLs. Our approach is based on natural language processing features which use word vector representation and ngram models on the blacklist word as the main features. Using this technique can help classifier distinguish benign URLs from malicious ones. Our experimentation shows that our approach can achieve a high degree of accuracy at 97.1% in the case of SVM. Moreover, it can maintain a high level of robustness with 0.97 precision and 0.93 recall scores.
doi_str_mv	10.3233/JIFS-169831
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2159996350</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2159996350</sourcerecordid><originalsourceid>FETCH-LOGICAL-c261t-bf483eba609c58d8b4ece2fb3a1fd98573a61d28e84f45b8e515dacd54d916613</originalsourceid><addsrcrecordid>eNotkE1LAzEQhoMoWKsn_0DAo6zme5OjVKuVgqD2HLLJRLa0m5psFf-9W9fTDMwz7wsPQpeU3HDG-e3zYv5WUWU0p0doQnUtK21UfTzsRImKMqFO0Vkpa0JoLRmZoMU99OD7NnU4Rbx1m9a3aV_w6nVZcOMKBDycvlMO-GvgUsYZdhkKdL37-3JdwN1HdttzdBLdpsDF_5yi1fzhffZULV8eF7O7ZeWZon3VRKE5NE4R46UOuhHggcWGOxqD0bLmTtHANGgRhWw0SCqD80GKYKhSlE_R1Zi7y-lzD6W367TP3VBpGZXGGMUlGajrkfI5lZIh2l1uty7_WErswZU9uLKjK_4Lzg5cKA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2159996350</pqid></control><display><type>article</type><title>Detection of malicious URLs based on word vector representation and ngram</title><source>Business Source Complete</source><creator>Hai, Quan Tran ; Hwang, Seong Oun</creator><contributor>Hwang, Seong Oun</contributor><creatorcontrib>Hai, Quan Tran ; Hwang, Seong Oun ; Hwang, Seong Oun</creatorcontrib><description>Most Intrusion Detection Systems (IDS) nowadays are signature-based. They are very useful and accurate for detecting known attacks. However, they are inefficient with unknown attacks. Moreover, most of cyber attacks start with malicious URLs. Attackers try to trick users into clicking on malicious URLs. This gives attackers an easy way to launch attacks. To defend against this, companies and organizations use IDS/IPS to detect malicous URLs using blacklist of URLs. This is very efficient with known malicious URLs, but useless with unknown malicious URLs. To overcome this problem, a number of malicious Web site detection systems have been proposed. One of the most promising methods is to apply machine learning detection techniques. In this paper, we present a new lexical approach to classify URLs by using machine learning techniques which patternize the URLs. Our approach is based on natural language processing features which use word vector representation and ngram models on the blacklist word as the main features. Using this technique can help classifier distinguish benign URLs from malicious ones. Our experimentation shows that our approach can achieve a high degree of accuracy at 97.1% in the case of SVM. Moreover, it can maintain a high level of robustness with 0.97 precision and 0.93 recall scores.</description><identifier>ISSN: 1064-1246</identifier><identifier>EISSN: 1875-8967</identifier><identifier>DOI: 10.3233/JIFS-169831</identifier><language>eng</language><publisher>Amsterdam: IOS Press BV</publisher><subject>Artificial intelligence ; Blacklisting ; Cybersecurity ; Experimentation ; Intrusion detection systems ; Machine learning ; Natural language processing ; Representations ; Websites</subject><ispartof>Journal of intelligent & fuzzy systems, 2018-01, Vol.35 (6), p.5889-5900</ispartof><rights>Copyright IOS Press BV 2018</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c261t-bf483eba609c58d8b4ece2fb3a1fd98573a61d28e84f45b8e515dacd54d916613</citedby><cites>FETCH-LOGICAL-c261t-bf483eba609c58d8b4ece2fb3a1fd98573a61d28e84f45b8e515dacd54d916613</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27903,27904</link.rule.ids></links><search><contributor>Hwang, Seong Oun</contributor><creatorcontrib>Hai, Quan Tran</creatorcontrib><creatorcontrib>Hwang, Seong Oun</creatorcontrib><title>Detection of malicious URLs based on word vector representation and ngram</title><title>Journal of intelligent & fuzzy systems</title><description>Most Intrusion Detection Systems (IDS) nowadays are signature-based. They are very useful and accurate for detecting known attacks. However, they are inefficient with unknown attacks. Moreover, most of cyber attacks start with malicious URLs. Attackers try to trick users into clicking on malicious URLs. This gives attackers an easy way to launch attacks. To defend against this, companies and organizations use IDS/IPS to detect malicous URLs using blacklist of URLs. This is very efficient with known malicious URLs, but useless with unknown malicious URLs. To overcome this problem, a number of malicious Web site detection systems have been proposed. One of the most promising methods is to apply machine learning detection techniques. In this paper, we present a new lexical approach to classify URLs by using machine learning techniques which patternize the URLs. Our approach is based on natural language processing features which use word vector representation and ngram models on the blacklist word as the main features. Using this technique can help classifier distinguish benign URLs from malicious ones. Our experimentation shows that our approach can achieve a high degree of accuracy at 97.1% in the case of SVM. Moreover, it can maintain a high level of robustness with 0.97 precision and 0.93 recall scores.</description><subject>Artificial intelligence</subject><subject>Blacklisting</subject><subject>Cybersecurity</subject><subject>Experimentation</subject><subject>Intrusion detection systems</subject><subject>Machine learning</subject><subject>Natural language processing</subject><subject>Representations</subject><subject>Websites</subject><issn>1064-1246</issn><issn>1875-8967</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><recordid>eNotkE1LAzEQhoMoWKsn_0DAo6zme5OjVKuVgqD2HLLJRLa0m5psFf-9W9fTDMwz7wsPQpeU3HDG-e3zYv5WUWU0p0doQnUtK21UfTzsRImKMqFO0Vkpa0JoLRmZoMU99OD7NnU4Rbx1m9a3aV_w6nVZcOMKBDycvlMO-GvgUsYZdhkKdL37-3JdwN1HdttzdBLdpsDF_5yi1fzhffZULV8eF7O7ZeWZon3VRKE5NE4R46UOuhHggcWGOxqD0bLmTtHANGgRhWw0SCqD80GKYKhSlE_R1Zi7y-lzD6W367TP3VBpGZXGGMUlGajrkfI5lZIh2l1uty7_WErswZU9uLKjK_4Lzg5cKA</recordid><startdate>20180101</startdate><enddate>20180101</enddate><creator>Hai, Quan Tran</creator><creator>Hwang, Seong Oun</creator><general>IOS Press BV</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20180101</creationdate><title>Detection of malicious URLs based on word vector representation and ngram</title><author>Hai, Quan Tran ; Hwang, Seong Oun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c261t-bf483eba609c58d8b4ece2fb3a1fd98573a61d28e84f45b8e515dacd54d916613</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Artificial intelligence</topic><topic>Blacklisting</topic><topic>Cybersecurity</topic><topic>Experimentation</topic><topic>Intrusion detection systems</topic><topic>Machine learning</topic><topic>Natural language processing</topic><topic>Representations</topic><topic>Websites</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Hai, Quan Tran</creatorcontrib><creatorcontrib>Hwang, Seong Oun</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Journal of intelligent & fuzzy systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hai, Quan Tran</au><au>Hwang, Seong Oun</au><au>Hwang, Seong Oun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Detection of malicious URLs based on word vector representation and ngram</atitle><jtitle>Journal of intelligent & fuzzy systems</jtitle><date>2018-01-01</date><risdate>2018</risdate><volume>35</volume><issue>6</issue><spage>5889</spage><epage>5900</epage><pages>5889-5900</pages><issn>1064-1246</issn><eissn>1875-8967</eissn><abstract>Most Intrusion Detection Systems (IDS) nowadays are signature-based. They are very useful and accurate for detecting known attacks. However, they are inefficient with unknown attacks. Moreover, most of cyber attacks start with malicious URLs. Attackers try to trick users into clicking on malicious URLs. This gives attackers an easy way to launch attacks. To defend against this, companies and organizations use IDS/IPS to detect malicous URLs using blacklist of URLs. This is very efficient with known malicious URLs, but useless with unknown malicious URLs. To overcome this problem, a number of malicious Web site detection systems have been proposed. One of the most promising methods is to apply machine learning detection techniques. In this paper, we present a new lexical approach to classify URLs by using machine learning techniques which patternize the URLs. Our approach is based on natural language processing features which use word vector representation and ngram models on the blacklist word as the main features. Using this technique can help classifier distinguish benign URLs from malicious ones. Our experimentation shows that our approach can achieve a high degree of accuracy at 97.1% in the case of SVM. Moreover, it can maintain a high level of robustness with 0.97 precision and 0.93 recall scores.</abstract><cop>Amsterdam</cop><pub>IOS Press BV</pub><doi>10.3233/JIFS-169831</doi><tpages>12</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 1064-1246
ispartof	Journal of intelligent & fuzzy systems, 2018-01, Vol.35 (6), p.5889-5900
issn	1064-1246 1875-8967
language	eng
recordid	cdi_proquest_journals_2159996350
source	Business Source Complete
subjects	Artificial intelligence Blacklisting Cybersecurity Experimentation Intrusion detection systems Machine learning Natural language processing Representations Websites
title	Detection of malicious URLs based on word vector representation and ngram
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T17%3A48%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Detection%20of%20malicious%20URLs%20based%20on%20word%20vector%20representation%20and%20ngram&rft.jtitle=Journal%20of%20intelligent%20&%20fuzzy%20systems&rft.au=Hai,%20Quan%20Tran&rft.date=2018-01-01&rft.volume=35&rft.issue=6&rft.spage=5889&rft.epage=5900&rft.pages=5889-5900&rft.issn=1064-1246&rft.eissn=1875-8967&rft_id=info:doi/10.3233/JIFS-169831&rft_dat=%3Cproquest_cross%3E2159996350%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2159996350&rft_id=info:pmid/&rfr_iscdi=true