HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data

Large language models (LLMs) have shown great potential for automatic code generation and form the basis for various tools such as GitHub Copilot. However, recent studies highlight that many LLM-generated code contains serious security vulnerabilities. While previous work tries to address this by tr...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Hajipour, Hossein, Schönherr, Lea, Holz, Thorsten, Fritz, Mario
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Cryptography and Security Computer Science - Learning Computer Science - Software Engineering
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Hajipour, Hossein Schönherr, Lea Holz, Thorsten Fritz, Mario
description	Large language models (LLMs) have shown great potential for automatic code generation and form the basis for various tools such as GitHub Copilot. However, recent studies highlight that many LLM-generated code contains serious security vulnerabilities. While previous work tries to address this by training models that generate secure code, these attempts remain constrained by limited access to training data and labor-intensive data preparation. In this paper, we introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes by automatically synthesizing secure codes, which reduces the effort of finding suitable training data. HexaCoder comprises two key components: an oracle-guided data synthesis pipeline and a two-step process for secure code generation. The data synthesis pipeline generates pairs of vulnerable and fixed codes for specific Common Weakness Enumeration (CWE) types by utilizing a state-of-the-art LLM for repairing vulnerable code. A security oracle identifies vulnerabilities, and a state-of-the-art LLM repairs them by extending and/or editing the codes, creating data pairs for fine-tuning using the Low-Rank Adaptation (LoRA) method. Each example of our fine-tuning dataset includes the necessary security-related libraries and code that form the basis of our novel two-step generation approach. This allows the model to integrate security-relevant libraries before generating the main code, significantly reducing the number of generated vulnerable codes by up to 85% compared to the baseline methods. We perform extensive evaluations on three different benchmarks for four LLMs, demonstrating that HexaCoder not only improves the security of the generated code but also maintains a high level of functional correctness.
doi_str_mv	10.48550/arxiv.2409.06446
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_06446</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_06446</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_064463</originalsourceid><addsrcrecordid>eNqFzcEKgkAQxvG9dIjqATo1L6BttUp1tVLo0EHvMuhUA7bGtIq-fSjdO3384YOfUsuN9s0-CPQapePW3xp98HVoTDhV14Q6jOqS5AgpFY0QDAUxWRJ0XFtoGeEmWFTkxQ2XVELaW_ckxwVkgmzZPuCEDudqcsfqQ4vfztTqcs6ixBvZ_C38Qunzgc9Hfvf_8QVNvToJ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data</title><source>arXiv.org</source><creator>Hajipour, Hossein ; Schönherr, Lea ; Holz, Thorsten ; Fritz, Mario</creator><creatorcontrib>Hajipour, Hossein ; Schönherr, Lea ; Holz, Thorsten ; Fritz, Mario</creatorcontrib><description>Large language models (LLMs) have shown great potential for automatic code generation and form the basis for various tools such as GitHub Copilot. However, recent studies highlight that many LLM-generated code contains serious security vulnerabilities. While previous work tries to address this by training models that generate secure code, these attempts remain constrained by limited access to training data and labor-intensive data preparation. In this paper, we introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes by automatically synthesizing secure codes, which reduces the effort of finding suitable training data. HexaCoder comprises two key components: an oracle-guided data synthesis pipeline and a two-step process for secure code generation. The data synthesis pipeline generates pairs of vulnerable and fixed codes for specific Common Weakness Enumeration (CWE) types by utilizing a state-of-the-art LLM for repairing vulnerable code. A security oracle identifies vulnerabilities, and a state-of-the-art LLM repairs them by extending and/or editing the codes, creating data pairs for fine-tuning using the Low-Rank Adaptation (LoRA) method. Each example of our fine-tuning dataset includes the necessary security-related libraries and code that form the basis of our novel two-step generation approach. This allows the model to integrate security-relevant libraries before generating the main code, significantly reducing the number of generated vulnerable codes by up to 85% compared to the baseline methods. We perform extensive evaluations on three different benchmarks for four LLMs, demonstrating that HexaCoder not only improves the security of the generated code but also maintains a high level of functional correctness.</description><identifier>DOI: 10.48550/arxiv.2409.06446</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Cryptography and Security ; Computer Science - Learning ; Computer Science - Software Engineering</subject><creationdate>2024-09</creationdate><rights>http://creativecommons.org/licenses/by-nc-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.06446$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.06446$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Hajipour, Hossein</creatorcontrib><creatorcontrib>Schönherr, Lea</creatorcontrib><creatorcontrib>Holz, Thorsten</creatorcontrib><creatorcontrib>Fritz, Mario</creatorcontrib><title>HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data</title><description>Large language models (LLMs) have shown great potential for automatic code generation and form the basis for various tools such as GitHub Copilot. However, recent studies highlight that many LLM-generated code contains serious security vulnerabilities. While previous work tries to address this by training models that generate secure code, these attempts remain constrained by limited access to training data and labor-intensive data preparation. In this paper, we introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes by automatically synthesizing secure codes, which reduces the effort of finding suitable training data. HexaCoder comprises two key components: an oracle-guided data synthesis pipeline and a two-step process for secure code generation. The data synthesis pipeline generates pairs of vulnerable and fixed codes for specific Common Weakness Enumeration (CWE) types by utilizing a state-of-the-art LLM for repairing vulnerable code. A security oracle identifies vulnerabilities, and a state-of-the-art LLM repairs them by extending and/or editing the codes, creating data pairs for fine-tuning using the Low-Rank Adaptation (LoRA) method. Each example of our fine-tuning dataset includes the necessary security-related libraries and code that form the basis of our novel two-step generation approach. This allows the model to integrate security-relevant libraries before generating the main code, significantly reducing the number of generated vulnerable codes by up to 85% compared to the baseline methods. We perform extensive evaluations on three different benchmarks for four LLMs, demonstrating that HexaCoder not only improves the security of the generated code but also maintains a high level of functional correctness.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Cryptography and Security</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Software Engineering</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFzcEKgkAQxvG9dIjqATo1L6BttUp1tVLo0EHvMuhUA7bGtIq-fSjdO3384YOfUsuN9s0-CPQapePW3xp98HVoTDhV14Q6jOqS5AgpFY0QDAUxWRJ0XFtoGeEmWFTkxQ2XVELaW_ckxwVkgmzZPuCEDudqcsfqQ4vfztTqcs6ixBvZ_C38Qunzgc9Hfvf_8QVNvToJ</recordid><startdate>20240910</startdate><enddate>20240910</enddate><creator>Hajipour, Hossein</creator><creator>Schönherr, Lea</creator><creator>Holz, Thorsten</creator><creator>Fritz, Mario</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240910</creationdate><title>HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data</title><author>Hajipour, Hossein ; Schönherr, Lea ; Holz, Thorsten ; Fritz, Mario</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_064463</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Cryptography and Security</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Software Engineering</topic><toplevel>online_resources</toplevel><creatorcontrib>Hajipour, Hossein</creatorcontrib><creatorcontrib>Schönherr, Lea</creatorcontrib><creatorcontrib>Holz, Thorsten</creatorcontrib><creatorcontrib>Fritz, Mario</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hajipour, Hossein</au><au>Schönherr, Lea</au><au>Holz, Thorsten</au><au>Fritz, Mario</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data</atitle><date>2024-09-10</date><risdate>2024</risdate><abstract>Large language models (LLMs) have shown great potential for automatic code generation and form the basis for various tools such as GitHub Copilot. However, recent studies highlight that many LLM-generated code contains serious security vulnerabilities. While previous work tries to address this by training models that generate secure code, these attempts remain constrained by limited access to training data and labor-intensive data preparation. In this paper, we introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes by automatically synthesizing secure codes, which reduces the effort of finding suitable training data. HexaCoder comprises two key components: an oracle-guided data synthesis pipeline and a two-step process for secure code generation. The data synthesis pipeline generates pairs of vulnerable and fixed codes for specific Common Weakness Enumeration (CWE) types by utilizing a state-of-the-art LLM for repairing vulnerable code. A security oracle identifies vulnerabilities, and a state-of-the-art LLM repairs them by extending and/or editing the codes, creating data pairs for fine-tuning using the Low-Rank Adaptation (LoRA) method. Each example of our fine-tuning dataset includes the necessary security-related libraries and code that form the basis of our novel two-step generation approach. This allows the model to integrate security-relevant libraries before generating the main code, significantly reducing the number of generated vulnerable codes by up to 85% compared to the baseline methods. We perform extensive evaluations on three different benchmarks for four LLMs, demonstrating that HexaCoder not only improves the security of the generated code but also maintains a high level of functional correctness.</abstract><doi>10.48550/arxiv.2409.06446</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2409.06446
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2409_06446
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Cryptography and Security Computer Science - Learning Computer Science - Software Engineering
title	HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T14%3A42%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=HexaCoder:%20Secure%20Code%20Generation%20via%20Oracle-Guided%20Synthetic%20Training%20Data&rft.au=Hajipour,%20Hossein&rft.date=2024-09-10&rft_id=info:doi/10.48550/arxiv.2409.06446&rft_dat=%3Carxiv_GOX%3E2409_06446%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true