CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics
Accurate identification of software vulnerabilities is crucial for system integrity. Vulnerability datasets, often derived from the National Vulnerability Database (NVD) or directly from GitHub, are essential for training machine learning models to detect these security flaws. However, these dataset...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Accurate identification of software vulnerabilities is crucial for system
integrity. Vulnerability datasets, often derived from the National
Vulnerability Database (NVD) or directly from GitHub, are essential for
training machine learning models to detect these security flaws. However, these
datasets frequently suffer from significant noise, typically 40% to 75%, due
primarily to the automatic and indiscriminate labeling of all changes in
vulnerability-fixing commits (VFCs) as vulnerability-related. This
misclassification occurs because not all changes in a commit aimed at fixing
vulnerabilities pertain to security threats; many are routine updates like bug
fixes or test improvements.
This paper introduces the first methodology that uses the Large Language
Model (LLM) with a heuristic enhancement to automatically identify
vulnerability-fixing changes from VFCs, achieving an F1-score of 0.82.
VulSifter was applied to a large-scale study, where we conducted a crawl of
127,063 repositories on GitHub, resulting in the acquisition of 5,352,105
commits. VulSifter involves utilizing an LLM to comprehend code semantics and
contextual information, while applying heuristics to filter out unrelated
changes. We then developed CleanVul, a high-quality dataset comprising 11,632
functions using our LLM heuristic enhancement approach, demonstrating
Correctness (90.6%) comparable to established datasets such as SVEN and
PrimeVul.
To evaluate the CleanVul dataset, we conducted experiments focusing on
fine-tuning various LLMs on CleanVul and other high-quality datasets.
Evaluation results reveal that LLMs fine-tuned on CleanVul not only exhibit
enhanced accuracy but also superior generalization capabilities compared to
those trained on uncleaned datasets. Specifically, models trained on CleanVul
and tested on PrimeVul achieve accuracy higher than those trained and tested
exclusively on PrimeVul. |
---|---|
DOI: | 10.48550/arxiv.2411.17274 |