METHOD FOR MACHINE LEARNING-BASED HARMFUL WEB SITE CLASSIFICATION

An objective of the present invention is to quickly identify an accessible address of a harmful website acting again after changing the domain despite continuous regulation and classify the harmful website to respond to corresponding activities in a timely manner based on a machine learning model. A...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
1. Verfasser:	SONG, NAM GOO
Format:	Patent
Sprache:	eng ; kor
Schlagworte:	CALCULATING COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS COMPUTING COUNTING ELECTRIC COMMUNICATION TECHNIQUE ELECTRICITY PHYSICS TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHICCOMMUNICATION
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	An objective of the present invention is to quickly identify an accessible address of a harmful website acting again after changing the domain despite continuous regulation and classify the harmful website to respond to corresponding activities in a timely manner based on a machine learning model. A machine learning-based harmful website classification method performed by a main server comprises: (a) a step of accessing a specific website; (b) a step of extracting the HTLM source code of the website, and preprocessing the HTLM source code to perform tokenization; (c) a step of vectorizing each token in accordance with a preset algorithm; and (d) a step of inputting each vector value into a machine learning model to determine whether the website is a harmful website. The machine learning model consists of a logistic regression model and predicts the probability that the website belongs to harmful websites based on output data if the output data are outputted as values between 0 and 1. 본 발명은 메인 서버에 의해 수행되는, 머신러닝 기반의 유해 사이트 분류 방법에 있어서, (a) 특정 웹 사이트에 접속하는 단계; (b) 상기 웹 사이트의 HTML 소스코드를 추출하고, 전처리하여 토큰화를 수행하는 단계; (c) 기 설정된 알고리즘에 따라, 각각의 토큰을 벡터화하는 단계; 및 (d) 기계학습모델에 각각의 벡터값을 입력하여 상기 웹 사이트의 유해 사이트 여부를 판단하는 단계를 포함하되, 상기 기계학습모델은, 로지스틱 회귀(Logistic Regression) 모델로 구성되며, 출력데이터가 0에서 1 사이의 값으로 출력될 경우, 출력데이터를 기초로 상기 웹 사이트가 유해 사이트에 속할 확률을 예측하는, 머신러닝 기반의 유해 사이트 분류 방법에 관한 것이다.