DongTing: A Large-scale Dataset for Anomaly Detection of the Linux Kernel

DongTing is the first large-scale dataset dedicated to Linux kernel anomaly detection. The dataset covers Linux kernels released in the last five years and includes a total of 18,966 well-labeled normal and attack sequences. The entire dataset is 85 GB in size (after decompression). The attack data...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	GuoYun Duan, Yuanzhi Fu, Minjie Cai, Hao Chen, Jianhua Sun
Format:	Dataset
Sprache:	eng
Schlagworte:	Linux Kernel, Anomaly Detection, Dataset, System Call, Machine Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	DongTing is the first large-scale dataset dedicated to Linux kernel anomaly detection. The dataset covers Linux kernels released in the last five years and includes a total of 18,966 well-labeled normal and attack sequences. The entire dataset is 85 GB in size (after decompression). The attack data covers 26 major kernel releases and contains a total of 12,116 system call sequences collected from running 17,855 bug-triggering programs. The normal data comes from 6,850 normal programs in four kernel regression test suites. We maintain the dataset and source code in Zenodo and Github, respectively, and back up the dataset and code in Baidu netdisk. Dataset The dataset is stored at http://doi.org/10.5281/zenodo.6627050 The data includes `abnormal_data`, `normal_data`, `models`, `npz` and baseline data, with a total volume of nearly 87 GB (including 85 GB for abnormal data and normal data, it's after decompression files size). The `Abnormal_data` directory contains 12,116 files containing system call sequence for 26 kernel releases, and the `Normal_data` directory contains 6,850 files containing system call sequences collected from four regression test suites. All of which are raw sequences. CNN/RNN, LSTM, and Wavenet (three sets of hyperparameters per model) machine learning models are selected, the ECOD model (without hyperparameters) was also chosen for the evaluation of DT. DT_abnormal, DT_normal, ADFA-LD, and PLAID are used for training respectively. The results of DT training models are stored in the directory `Models-DongTing`, and the results of ADFA-LD and PLAID training models are stored in the directory `Models-Comparison`. The directory `npz `stores the encoded dataset of DongTing, ADFA-LD, and PLAID (sequence length varies from 8 to 4495), according to syscall_64.tbl in Linux kernel 5.17, including the training set, validation set, and test set. The file `Baseline.xlsx` contains all the information about DongTing dataset, which can be used in training machine learning models. For example, the whole dataset is randomly divided into three sets with the ratio of 80%:10%:10% (training: validation: test). The implementation of dataset division can be found in the source code. Source Code The source code for dataset development is stored at https://github.com/HNUSystemsLab/DongTing and the following is a brief introduction. The source code contains three folders, i.e., `Source Code Files`, `Documents` and `DB`, where `Documents `stores the detailed
DOI:	10.5281/zenodo.6627049