Exploration into Gray Area: Toward Efficient Labeling for Detecting Malicious Domain Names

In this paper, we propose a method to reduce the labeling cost while acquiring training data for a malicious domain name detection system using supervised machine learning. In the conventional systems, to train a classifier with high classification accuracy, large quantities of benign and malicious...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEICE Transactions on Communications 2020/04/01, Vol.E103.B(4), pp.375-388
Hauptverfasser:	FUKUSHI, Naoki, CHIBA, Daiki, AKIYAMA, Mitsuaki, UCHIDA, Masato
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy active learning Classification Classifiers Comparative analysis Data acquisition data labeling Domain names ensemble learning Labeling Labels Machine learning malicious domain name Training URLs
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	388
container_issue	4
container_start_page	375
container_title	IEICE Transactions on Communications
container_volume	E103.B
creator	FUKUSHI, Naoki CHIBA, Daiki AKIYAMA, Mitsuaki UCHIDA, Masato
description	In this paper, we propose a method to reduce the labeling cost while acquiring training data for a malicious domain name detection system using supervised machine learning. In the conventional systems, to train a classifier with high classification accuracy, large quantities of benign and malicious domain names need to be prepared as training data. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by using approximately 1% of the training data used by the conventional systems. Another disadvantage of the conventional system is that if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved. The combination of the two methods proposed here allows us to develop a new system for malicious domain name detection with high classification accuracy and generalization ability by labeling a small amount of training data.
doi_str_mv	10.1587/transcom.2019NRP0005
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2385440694</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2385440694</sourcerecordid><originalsourceid>FETCH-LOGICAL-c420t-647796afe2df09ef7d1f5be0f9abf4c8c870510e4b07350d6971a6552597d9663</originalsourceid><addsrcrecordid>eNpNkMlOwzAQhi0EEmV5Aw6WOAfGiZeYG0tZpFJQBRcu1tSxIVUaF9sV8PYUle00M9L3_SP9hBwwOGKiVsc5Yp9smB-VwPR4cg8AYoMMmOKiYBUXm2QAmsmiFkxuk52UZgCsLlk5IE_D90UXIuY29LTtc6BXET_oaXR4Qh_CG8aGDr1vbev6TEc4dV3bP1MfIr1w2dn8dd1itwLCMtGLMMe2p2Ocu7RHtjx2ye1_z13yeDl8OL8uRndXN-eno8LyEnIhuVJaondl40E7rxrmxdSB1zj13Na2ViAYOD4FVQlopFYMpRCl0KrRUla75HCdu4jhdelSNrOwjP3qpSmrWnAOUvMVxdeUjSGl6LxZxHaO8cMwMF8tmp8Wzb8WV9pkrc1Sxmf3K2HMre3cnzRkUJkzw3-WfyG_sH3BaFxffQIKz4Sy</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2385440694</pqid></control><display><type>article</type><title>Exploration into Gray Area: Toward Efficient Labeling for Detecting Malicious Domain Names</title><source>Alma/SFX Local Collection</source><creator>FUKUSHI, Naoki ; CHIBA, Daiki ; AKIYAMA, Mitsuaki ; UCHIDA, Masato</creator><creatorcontrib>FUKUSHI, Naoki ; CHIBA, Daiki ; AKIYAMA, Mitsuaki ; UCHIDA, Masato</creatorcontrib><description>In this paper, we propose a method to reduce the labeling cost while acquiring training data for a malicious domain name detection system using supervised machine learning. In the conventional systems, to train a classifier with high classification accuracy, large quantities of benign and malicious domain names need to be prepared as training data. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by using approximately 1% of the training data used by the conventional systems. Another disadvantage of the conventional system is that if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved. The combination of the two methods proposed here allows us to develop a new system for malicious domain name detection with high classification accuracy and generalization ability by labeling a small amount of training data.</description><identifier>ISSN: 0916-8516</identifier><identifier>EISSN: 1745-1345</identifier><identifier>DOI: 10.1587/transcom.2019NRP0005</identifier><language>eng</language><publisher>Tokyo: The Institute of Electronics, Information and Communication Engineers</publisher><subject>Accuracy ; active learning ; Classification ; Classifiers ; Comparative analysis ; Data acquisition ; data labeling ; Domain names ; ensemble learning ; Labeling ; Labels ; Machine learning ; malicious domain name ; Training ; URLs</subject><ispartof>IEICE Transactions on Communications, 2020/04/01, Vol.E103.B(4), pp.375-388</ispartof><rights>2020 The Institute of Electronics, Information and Communication Engineers</rights><rights>Copyright Japan Science and Technology Agency 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c420t-647796afe2df09ef7d1f5be0f9abf4c8c870510e4b07350d6971a6552597d9663</citedby><cites>FETCH-LOGICAL-c420t-647796afe2df09ef7d1f5be0f9abf4c8c870510e4b07350d6971a6552597d9663</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>FUKUSHI, Naoki</creatorcontrib><creatorcontrib>CHIBA, Daiki</creatorcontrib><creatorcontrib>AKIYAMA, Mitsuaki</creatorcontrib><creatorcontrib>UCHIDA, Masato</creatorcontrib><title>Exploration into Gray Area: Toward Efficient Labeling for Detecting Malicious Domain Names</title><title>IEICE Transactions on Communications</title><addtitle>IEICE Trans. Commun.</addtitle><description>In this paper, we propose a method to reduce the labeling cost while acquiring training data for a malicious domain name detection system using supervised machine learning. In the conventional systems, to train a classifier with high classification accuracy, large quantities of benign and malicious domain names need to be prepared as training data. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by using approximately 1% of the training data used by the conventional systems. Another disadvantage of the conventional system is that if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved. The combination of the two methods proposed here allows us to develop a new system for malicious domain name detection with high classification accuracy and generalization ability by labeling a small amount of training data.</description><subject>Accuracy</subject><subject>active learning</subject><subject>Classification</subject><subject>Classifiers</subject><subject>Comparative analysis</subject><subject>Data acquisition</subject><subject>data labeling</subject><subject>Domain names</subject><subject>ensemble learning</subject><subject>Labeling</subject><subject>Labels</subject><subject>Machine learning</subject><subject>malicious domain name</subject><subject>Training</subject><subject>URLs</subject><issn>0916-8516</issn><issn>1745-1345</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNpNkMlOwzAQhi0EEmV5Aw6WOAfGiZeYG0tZpFJQBRcu1tSxIVUaF9sV8PYUle00M9L3_SP9hBwwOGKiVsc5Yp9smB-VwPR4cg8AYoMMmOKiYBUXm2QAmsmiFkxuk52UZgCsLlk5IE_D90UXIuY29LTtc6BXET_oaXR4Qh_CG8aGDr1vbev6TEc4dV3bP1MfIr1w2dn8dd1itwLCMtGLMMe2p2Ocu7RHtjx2ye1_z13yeDl8OL8uRndXN-eno8LyEnIhuVJaondl40E7rxrmxdSB1zj13Na2ViAYOD4FVQlopFYMpRCl0KrRUla75HCdu4jhdelSNrOwjP3qpSmrWnAOUvMVxdeUjSGl6LxZxHaO8cMwMF8tmp8Wzb8WV9pkrc1Sxmf3K2HMre3cnzRkUJkzw3-WfyG_sH3BaFxffQIKz4Sy</recordid><startdate>20200401</startdate><enddate>20200401</enddate><creator>FUKUSHI, Naoki</creator><creator>CHIBA, Daiki</creator><creator>AKIYAMA, Mitsuaki</creator><creator>UCHIDA, Masato</creator><general>The Institute of Electronics, Information and Communication Engineers</general><general>Japan Science and Technology Agency</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>8FD</scope><scope>L7M</scope></search><sort><creationdate>20200401</creationdate><title>Exploration into Gray Area: Toward Efficient Labeling for Detecting Malicious Domain Names</title><author>FUKUSHI, Naoki ; CHIBA, Daiki ; AKIYAMA, Mitsuaki ; UCHIDA, Masato</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c420t-647796afe2df09ef7d1f5be0f9abf4c8c870510e4b07350d6971a6552597d9663</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Accuracy</topic><topic>active learning</topic><topic>Classification</topic><topic>Classifiers</topic><topic>Comparative analysis</topic><topic>Data acquisition</topic><topic>data labeling</topic><topic>Domain names</topic><topic>ensemble learning</topic><topic>Labeling</topic><topic>Labels</topic><topic>Machine learning</topic><topic>malicious domain name</topic><topic>Training</topic><topic>URLs</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>FUKUSHI, Naoki</creatorcontrib><creatorcontrib>CHIBA, Daiki</creatorcontrib><creatorcontrib>AKIYAMA, Mitsuaki</creatorcontrib><creatorcontrib>UCHIDA, Masato</creatorcontrib><collection>CrossRef</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>Advanced Technologies Database with Aerospace</collection><jtitle>IEICE Transactions on Communications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>FUKUSHI, Naoki</au><au>CHIBA, Daiki</au><au>AKIYAMA, Mitsuaki</au><au>UCHIDA, Masato</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Exploration into Gray Area: Toward Efficient Labeling for Detecting Malicious Domain Names</atitle><jtitle>IEICE Transactions on Communications</jtitle><addtitle>IEICE Trans. Commun.</addtitle><date>2020-04-01</date><risdate>2020</risdate><volume>E103.B</volume><issue>4</issue><spage>375</spage><epage>388</epage><pages>375-388</pages><issn>0916-8516</issn><eissn>1745-1345</eissn><abstract>In this paper, we propose a method to reduce the labeling cost while acquiring training data for a malicious domain name detection system using supervised machine learning. In the conventional systems, to train a classifier with high classification accuracy, large quantities of benign and malicious domain names need to be prepared as training data. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by using approximately 1% of the training data used by the conventional systems. Another disadvantage of the conventional system is that if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved. The combination of the two methods proposed here allows us to develop a new system for malicious domain name detection with high classification accuracy and generalization ability by labeling a small amount of training data.</abstract><cop>Tokyo</cop><pub>The Institute of Electronics, Information and Communication Engineers</pub><doi>10.1587/transcom.2019NRP0005</doi><tpages>14</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0916-8516
ispartof	IEICE Transactions on Communications, 2020/04/01, Vol.E103.B(4), pp.375-388
issn	0916-8516 1745-1345
language	eng
recordid	cdi_proquest_journals_2385440694
source	Alma/SFX Local Collection
subjects	Accuracy active learning Classification Classifiers Comparative analysis Data acquisition data labeling Domain names ensemble learning Labeling Labels Machine learning malicious domain name Training URLs
title	Exploration into Gray Area: Toward Efficient Labeling for Detecting Malicious Domain Names
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-15T12%3A07%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Exploration%20into%20Gray%20Area:%20Toward%20Efficient%20Labeling%20for%20Detecting%20Malicious%20Domain%20Names&rft.jtitle=IEICE%20Transactions%20on%20Communications&rft.au=FUKUSHI,%20Naoki&rft.date=2020-04-01&rft.volume=E103.B&rft.issue=4&rft.spage=375&rft.epage=388&rft.pages=375-388&rft.issn=0916-8516&rft.eissn=1745-1345&rft_id=info:doi/10.1587/transcom.2019NRP0005&rft_dat=%3Cproquest_cross%3E2385440694%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2385440694&rft_id=info:pmid/&rfr_iscdi=true