RedCode: Risky Code Execution and Generation Benchmark for Code Agents

With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-11
Hauptverfasser:	Guo, Chengquan, Liu, Xun, Xie, Chulin, Zhou, Andy, Zeng, Yi, Lin, Zinan, Song, Dawn, Li, Bo
Format:	Artikel
Sprache:	eng
Schlagworte:	Benchmarks Coding Rejection rate Software
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Guo, Chengquan Liu, Xun Xie, Chulin Zhou, Andy Zeng, Yi Lin, Zinan Song, Dawn Li, Bo
description	With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents' ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents' vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at https://github.com/AI-secure/RedCode.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3128019993</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3128019993</sourcerecordid><originalsourceid>FETCH-proquest_journals_31280199933</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRwC0pNcc5PSbVSCMoszq5UALEVXCtSk0tLMvPzFBLzUhTcU_NSixLBXKfUvOSM3MSibIW0_CKIWsf01LySYh4G1rTEnOJUXijNzaDs5hri7KFbUJRfWJpaXBKflV9alAeUijc2NLIwMLS0tDQ2Jk4VADMtOkQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3128019993</pqid></control><display><type>article</type><title>RedCode: Risky Code Execution and Generation Benchmark for Code Agents</title><source>Freely Accessible Journals_</source><creator>Guo, Chengquan ; Liu, Xun ; Xie, Chulin ; Zhou, Andy ; Zeng, Yi ; Lin, Zinan ; Song, Dawn ; Li, Bo</creator><creatorcontrib>Guo, Chengquan ; Liu, Xun ; Xie, Chulin ; Zhou, Andy ; Zeng, Yi ; Lin, Zinan ; Song, Dawn ; Li, Bo</creatorcontrib><description>With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents' ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents' vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at https://github.com/AI-secure/RedCode.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Benchmarks ; Coding ; Rejection rate ; Software</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>781,785</link.rule.ids></links><search><creatorcontrib>Guo, Chengquan</creatorcontrib><creatorcontrib>Liu, Xun</creatorcontrib><creatorcontrib>Xie, Chulin</creatorcontrib><creatorcontrib>Zhou, Andy</creatorcontrib><creatorcontrib>Zeng, Yi</creatorcontrib><creatorcontrib>Lin, Zinan</creatorcontrib><creatorcontrib>Song, Dawn</creatorcontrib><creatorcontrib>Li, Bo</creatorcontrib><title>RedCode: Risky Code Execution and Generation Benchmark for Code Agents</title><title>arXiv.org</title><description>With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents' ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents' vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at https://github.com/AI-secure/RedCode.</description><subject>Benchmarks</subject><subject>Coding</subject><subject>Rejection rate</subject><subject>Software</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRwC0pNcc5PSbVSCMoszq5UALEVXCtSk0tLMvPzFBLzUhTcU_NSixLBXKfUvOSM3MSibIW0_CKIWsf01LySYh4G1rTEnOJUXijNzaDs5hri7KFbUJRfWJpaXBKflV9alAeUijc2NLIwMLS0tDQ2Jk4VADMtOkQ</recordid><startdate>20241112</startdate><enddate>20241112</enddate><creator>Guo, Chengquan</creator><creator>Liu, Xun</creator><creator>Xie, Chulin</creator><creator>Zhou, Andy</creator><creator>Zeng, Yi</creator><creator>Lin, Zinan</creator><creator>Song, Dawn</creator><creator>Li, Bo</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241112</creationdate><title>RedCode: Risky Code Execution and Generation Benchmark for Code Agents</title><author>Guo, Chengquan ; Liu, Xun ; Xie, Chulin ; Zhou, Andy ; Zeng, Yi ; Lin, Zinan ; Song, Dawn ; Li, Bo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31280199933</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Benchmarks</topic><topic>Coding</topic><topic>Rejection rate</topic><topic>Software</topic><toplevel>online_resources</toplevel><creatorcontrib>Guo, Chengquan</creatorcontrib><creatorcontrib>Liu, Xun</creatorcontrib><creatorcontrib>Xie, Chulin</creatorcontrib><creatorcontrib>Zhou, Andy</creatorcontrib><creatorcontrib>Zeng, Yi</creatorcontrib><creatorcontrib>Lin, Zinan</creatorcontrib><creatorcontrib>Song, Dawn</creatorcontrib><creatorcontrib>Li, Bo</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>ProQuest Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Guo, Chengquan</au><au>Liu, Xun</au><au>Xie, Chulin</au><au>Zhou, Andy</au><au>Zeng, Yi</au><au>Lin, Zinan</au><au>Song, Dawn</au><au>Li, Bo</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>RedCode: Risky Code Execution and Generation Benchmark for Code Agents</atitle><jtitle>arXiv.org</jtitle><date>2024-11-12</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents' ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents' vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at https://github.com/AI-secure/RedCode.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3128019993
source	Freely Accessible Journals_
subjects	Benchmarks Coding Rejection rate Software
title	RedCode: Risky Code Execution and Generation Benchmark for Code Agents
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-17T21%3A19%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=RedCode:%20Risky%20Code%20Execution%20and%20Generation%20Benchmark%20for%20Code%20Agents&rft.jtitle=arXiv.org&rft.au=Guo,%20Chengquan&rft.date=2024-11-12&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3128019993%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3128019993&rft_id=info:pmid/&rfr_iscdi=true