Exploring the Robustness of Large Language Models for Solving Programming Problems

Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Shirafuji, Atsushi, Watanobe, Yutaka, Ito, Takumi, Morishita, Makoto, Nakamura, Yuki, Oda, Yusuke, Suzuki, Jun
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Software Engineering
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Shirafuji, Atsushi Watanobe, Yutaka Ito, Takumi Morishita, Makoto Nakamura, Yuki Oda, Yusuke Suzuki, Jun
description	Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this research question, we conduct experiments to understand the robustness of several popular LLMs, CodeGen and GPT-3.5 series models, capable of tackling code generation tasks in introductory programming problems. Our experimental results show that CodeGen and Codex are sensitive to the superficial modifications of problem descriptions and significantly impact code generation performance. Furthermore, we observe that Codex relies on variable names, as randomized variables decrease the solved rate significantly. However, the state-of-the-art (SOTA) models, such as InstructGPT and ChatGPT, show higher robustness to superficial modifications and have an outstanding capability for solving programming problems. This highlights the fact that slight modifications to the prompts given to the LLMs can greatly affect code generation performance, and careful formatting of prompts is essential for high-quality code generation, while the SOTA models are becoming more robust to perturbations.
doi_str_mv	10.48550/arxiv.2306.14583
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2306_14583</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2306_14583</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-a3d6d4ea8c7835437aa37ad89221c3c3d1d756a2915417b1783c584d5ebbfd3f3</originalsourceid><addsrcrecordid>eNotj01uwjAQhb1hUQEH6Kq-QNI4Y8dmWSFKK6VqRdlH49hOIzkxsgHR2zdQFu9n8Wakj5BHVuRcCVE8Y7z057yEosoZFwoeyG5zOfgQ-7Gjxx9Ld0Gf0nG0KdHgaI2xs5OP3Qmn8hGM9Ym6EOl38OfrzVcMXcRhuHft7ZAWZObQJ7u855zsXzf79VtWf27f1y91hpWEDMFUhltUrVQgOEjESUatypK10IJhRooKyxUTnEnNplUrFDfCau0MOJiTp_-3N6jmEPsB429zhWtucPAHE6RKhQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Exploring the Robustness of Large Language Models for Solving Programming Problems</title><source>arXiv.org</source><creator>Shirafuji, Atsushi ; Watanobe, Yutaka ; Ito, Takumi ; Morishita, Makoto ; Nakamura, Yuki ; Oda, Yusuke ; Suzuki, Jun</creator><creatorcontrib>Shirafuji, Atsushi ; Watanobe, Yutaka ; Ito, Takumi ; Morishita, Makoto ; Nakamura, Yuki ; Oda, Yusuke ; Suzuki, Jun</creatorcontrib><description>Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this research question, we conduct experiments to understand the robustness of several popular LLMs, CodeGen and GPT-3.5 series models, capable of tackling code generation tasks in introductory programming problems. Our experimental results show that CodeGen and Codex are sensitive to the superficial modifications of problem descriptions and significantly impact code generation performance. Furthermore, we observe that Codex relies on variable names, as randomized variables decrease the solved rate significantly. However, the state-of-the-art (SOTA) models, such as InstructGPT and ChatGPT, show higher robustness to superficial modifications and have an outstanding capability for solving programming problems. This highlights the fact that slight modifications to the prompts given to the LLMs can greatly affect code generation performance, and careful formatting of prompts is essential for high-quality code generation, while the SOTA models are becoming more robust to perturbations.</description><identifier>DOI: 10.48550/arxiv.2306.14583</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Software Engineering</subject><creationdate>2023-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2306.14583$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2306.14583$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Shirafuji, Atsushi</creatorcontrib><creatorcontrib>Watanobe, Yutaka</creatorcontrib><creatorcontrib>Ito, Takumi</creatorcontrib><creatorcontrib>Morishita, Makoto</creatorcontrib><creatorcontrib>Nakamura, Yuki</creatorcontrib><creatorcontrib>Oda, Yusuke</creatorcontrib><creatorcontrib>Suzuki, Jun</creatorcontrib><title>Exploring the Robustness of Large Language Models for Solving Programming Problems</title><description>Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this research question, we conduct experiments to understand the robustness of several popular LLMs, CodeGen and GPT-3.5 series models, capable of tackling code generation tasks in introductory programming problems. Our experimental results show that CodeGen and Codex are sensitive to the superficial modifications of problem descriptions and significantly impact code generation performance. Furthermore, we observe that Codex relies on variable names, as randomized variables decrease the solved rate significantly. However, the state-of-the-art (SOTA) models, such as InstructGPT and ChatGPT, show higher robustness to superficial modifications and have an outstanding capability for solving programming problems. This highlights the fact that slight modifications to the prompts given to the LLMs can greatly affect code generation performance, and careful formatting of prompts is essential for high-quality code generation, while the SOTA models are becoming more robust to perturbations.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Software Engineering</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj01uwjAQhb1hUQEH6Kq-QNI4Y8dmWSFKK6VqRdlH49hOIzkxsgHR2zdQFu9n8Wakj5BHVuRcCVE8Y7z057yEosoZFwoeyG5zOfgQ-7Gjxx9Ld0Gf0nG0KdHgaI2xs5OP3Qmn8hGM9Ym6EOl38OfrzVcMXcRhuHft7ZAWZObQJ7u855zsXzf79VtWf27f1y91hpWEDMFUhltUrVQgOEjESUatypK10IJhRooKyxUTnEnNplUrFDfCau0MOJiTp_-3N6jmEPsB429zhWtucPAHE6RKhQ</recordid><startdate>20230626</startdate><enddate>20230626</enddate><creator>Shirafuji, Atsushi</creator><creator>Watanobe, Yutaka</creator><creator>Ito, Takumi</creator><creator>Morishita, Makoto</creator><creator>Nakamura, Yuki</creator><creator>Oda, Yusuke</creator><creator>Suzuki, Jun</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230626</creationdate><title>Exploring the Robustness of Large Language Models for Solving Programming Problems</title><author>Shirafuji, Atsushi ; Watanobe, Yutaka ; Ito, Takumi ; Morishita, Makoto ; Nakamura, Yuki ; Oda, Yusuke ; Suzuki, Jun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-a3d6d4ea8c7835437aa37ad89221c3c3d1d756a2915417b1783c584d5ebbfd3f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Software Engineering</topic><toplevel>online_resources</toplevel><creatorcontrib>Shirafuji, Atsushi</creatorcontrib><creatorcontrib>Watanobe, Yutaka</creatorcontrib><creatorcontrib>Ito, Takumi</creatorcontrib><creatorcontrib>Morishita, Makoto</creatorcontrib><creatorcontrib>Nakamura, Yuki</creatorcontrib><creatorcontrib>Oda, Yusuke</creatorcontrib><creatorcontrib>Suzuki, Jun</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shirafuji, Atsushi</au><au>Watanobe, Yutaka</au><au>Ito, Takumi</au><au>Morishita, Makoto</au><au>Nakamura, Yuki</au><au>Oda, Yusuke</au><au>Suzuki, Jun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Exploring the Robustness of Large Language Models for Solving Programming Problems</atitle><date>2023-06-26</date><risdate>2023</risdate><abstract>Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this research question, we conduct experiments to understand the robustness of several popular LLMs, CodeGen and GPT-3.5 series models, capable of tackling code generation tasks in introductory programming problems. Our experimental results show that CodeGen and Codex are sensitive to the superficial modifications of problem descriptions and significantly impact code generation performance. Furthermore, we observe that Codex relies on variable names, as randomized variables decrease the solved rate significantly. However, the state-of-the-art (SOTA) models, such as InstructGPT and ChatGPT, show higher robustness to superficial modifications and have an outstanding capability for solving programming problems. This highlights the fact that slight modifications to the prompts given to the LLMs can greatly affect code generation performance, and careful formatting of prompts is essential for high-quality code generation, while the SOTA models are becoming more robust to perturbations.</abstract><doi>10.48550/arxiv.2306.14583</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2306.14583
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2306_14583
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Software Engineering
title	Exploring the Robustness of Large Language Models for Solving Programming Problems
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T18%3A18%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Exploring%20the%20Robustness%20of%20Large%20Language%20Models%20for%20Solving%20Programming%20Problems&rft.au=Shirafuji,%20Atsushi&rft.date=2023-06-26&rft_id=info:doi/10.48550/arxiv.2306.14583&rft_dat=%3Carxiv_GOX%3E2306_14583%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true