Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection

This paper presents FUNDED (Flow-sensitive vUl-Nerability coDE Detection), a novel learning framework for building vulnerability detection models. Funded leverages the advances in graph neural networks (GNNs) to develop a novel graph-based learning method to capture and reason about the program'...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on information forensics and security 2021, Vol.16, p.1943-1958
Hauptverfasser:	Wang, Huanting, Ye, Guixin, Tang, Zhanyong, Tan, Shin Hwei, Huang, Songfang, Fang, Dingyi, Feng, Yansong, Bian, Lizhong, Wang, Zheng
Format:	Artikel
Sprache:	eng
Schlagworte:	code vulnerability detection Computer bugs Data collection Data models deep graph neural networks deep learning Evaluation Graph neural networks Graph representations Graphical representations Predictive models Semantics Software Software reliability Software vulnerability Source code Statistical analysis Statistical methods Syntactics Training Training data
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper presents FUNDED (Flow-sensitive vUl-Nerability coDE Detection), a novel learning framework for building vulnerability detection models. Funded leverages the advances in graph neural networks (GNNs) to develop a novel graph-based learning method to capture and reason about the program's control, data, and call dependencies. Unlike prior work that treats the program as a sequential sequence or an untyped graph, Funded learns and operates on a graph representation of the program source code, in which individual statements are connected to other statements through relational edges. By capturing the program syntax, semantics and flows, Funded finds better code representation for the downstream software vulnerability detection task. To provide sufficient training data to build an effective deep learning model, we combine probabilistic learning and statistical assessments to automatically gather high-quality training samples from open-source projects. This provides many real-life vulnerable code training samples to complement the limited vulnerable code samples available in standard vulnerability databases. We apply Funded to identify software vulnerabilities at the function level from program source code. We evaluate Funded on large real-world datasets with programs written in C, Java, Swift and Php, and compare it against six state-of-the-art code vulnerability detection models. Experimental results show that Funded significantly outperforms alternative approaches across evaluation settings.
ISSN:	1556-6013 1556-6021
DOI:	10.1109/TIFS.2020.3044773