Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries
Security experts reverse engineer (decompile) binary code to identify critical security vulnerabilities. The limited access to source code in vital systems - such as firmware, drivers, and proprietary software used in Critical Infrastructures (CI) - makes this analysis even more crucial on the binar...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Security experts reverse engineer (decompile) binary code to identify
critical security vulnerabilities. The limited access to source code in vital
systems - such as firmware, drivers, and proprietary software used in Critical
Infrastructures (CI) - makes this analysis even more crucial on the binary
level. Even with available source code, a semantic gap persists after
compilation between the source and the binary code executed by the processor.
This gap may hinder the detection of vulnerabilities in source code. That being
said, current research on Large Language Models (LLMs) overlooks the
significance of decompiled binaries in this area by focusing solely on source
code. In this work, we are the first to empirically uncover the substantial
semantic limitations of state-of-the-art LLMs when it comes to analyzing
vulnerabilities in decompiled binaries, largely due to the absence of relevant
datasets. To bridge the gap, we introduce DeBinVul, a novel decompiled binary
code vulnerability dataset. Our dataset is multi-architecture and
multi-optimization, focusing on C/C++ due to their wide usage in CI and
association with numerous vulnerabilities. Specifically, we curate 150,872
samples of vulnerable and non-vulnerable decompiled binary code for the task of
(i) identifying; (ii) classifying; (iii) describing vulnerabilities; and (iv)
recovering function names in the domain of decompiled binaries. Subsequently,
we fine-tune state-of-the-art LLMs using DeBinVul and report on a performance
increase of 19%, 24%, and 21% in the capabilities of CodeLlama, Llama3, and
CodeGen2 respectively, in detecting binary code vulnerabilities. Additionally,
using DeBinVul, we report a high performance of 80-90% on the vulnerability
classification task. Furthermore, we report improved performance in function
name recovery and vulnerability description tasks. |
---|---|
DOI: | 10.48550/arxiv.2411.04981 |