STIOCS: Active learning-based semi-supervised training framework for IOC extraction

Cyber Threat Intelligence (CTI) contains numerous Indicators of Compromise (IOCs) and contextual information, crucial for understanding threat actors’ behavior and intentions. However, current information extraction predominantly relies on supervised learning algorithms, presenting challenges in the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computers & electrical engineering 2023-12, Vol.112, p.108981, Article 108981
Hauptverfasser: Tang, Binhui, Li, Xiaohui, Wang, Junfeng, Ge, Wenhan, Yu, Zhongkun, Lin, Tongcan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Cyber Threat Intelligence (CTI) contains numerous Indicators of Compromise (IOCs) and contextual information, crucial for understanding threat actors’ behavior and intentions. However, current information extraction predominantly relies on supervised learning algorithms, presenting challenges in the field of CTI for two reasons. Firstly, the scarcity of labeled data with IOCs hampers the effectiveness of supervised learning. Secondly, existing methods struggle to extract comprehensive contextual features, posing difficulties in IOC recognition within CTI. To address these limitations and better suit the unique characteristics of CTI text, this paper introduces STIOCS, a semi-supervised framework that combines active learning and self-training for IOC extraction. STIOCS enhances IOC extraction accuracy and efficiency by leveraging limited labeled data and a rich unannotated corpus. Firstly, the Active Learning (AL) approach uses the Density-based Spatial Clustering of Applications with Noise (DBSCAN) algorithm to select reliable samples that can reduce noise pollution on pseudo-labeling in self-training. The extraction model integrates Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) algorithms to extract local and sequential features from CTI text, respectively. Then, the semantic features are enhanced by using the different sizes of convolutional kernels to fuse the two types of features. Finally, the Conditional Random Fields (CRF) layer is employed to recognize IOC entities. Our experimental results demonstrate the effectiveness and robustness of our proposed method in IOC extraction, even with limited labeled data. Compared to supervised methods, our proposed method is only approximately 40% of the dataset is labeled, the F1 scores are achieved better than the existing methods and exhibit consistent performance improvements as the dataset size increases. STIOCS effectively suppresses weak label noise, reduces training costs, and enhances the recognition model’s performance. It provides a cost-effective training framework for entity extraction in cyber threat intelligence. [Display omitted] •This paper proposes a semi-supervised active learning framework, STIOCS, aimed at improving the efficiency of IOC extraction in CTI and addressing the challenge of model degradation due to inadequate IOC annotation data.•Firstly, we extract valuable information from unlabeled data during self-training, but pseudo-labeling noise can potentially d
ISSN:0045-7906
1879-0755
DOI:10.1016/j.compeleceng.2023.108981