On Generating and Labeling Network Traffic with Realistic, Self-Propagating Malware
Research and development of techniques which detect or remediate malicious network activity require access to diverse, realistic, contemporary data sets containing labeled malicious connections. In the absence of such data, said techniques cannot be meaningfully trained, tested, and evaluated. Synth...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Research and development of techniques which detect or remediate malicious
network activity require access to diverse, realistic, contemporary data sets
containing labeled malicious connections. In the absence of such data, said
techniques cannot be meaningfully trained, tested, and evaluated. Synthetically
produced data containing fabricated or merged network traffic is of limited
value as it is easily distinguishable from real traffic by even simple machine
learning (ML) algorithms. Real network data is preferable, but while ubiquitous
is broadly both sensitive and lacking in ground truth labels, limiting its
utility for ML research.
This paper presents a multi-faceted approach to generating a data set of
labeled malicious connections embedded within anonymized network traffic
collected from large production networks. Real-world malware is defanged and
introduced to simulated, secured nodes within those networks to generate
realistic traffic while maintaining sufficient isolation to protect real data
and infrastructure. Network sensor data, including this embedded malware
traffic, is collected at a network edge and anonymized for research use.
Network traffic was collected and produced in accordance with the
aforementioned methods at two major educational institutions. The result is a
highly realistic, long term, multi-institution data set with embedded data
labels spanning over 1.5 trillion connections and over a petabyte of sensor log
data. The usability of this data set is demonstrated by its utility to our
artificial intelligence and machine learning (AI/ML) research program. |
---|---|
DOI: | 10.48550/arxiv.2104.10034 |