Exploring the Robustness of Large Language Models for Solving Programming Problems

Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Shirafuji, Atsushi, Watanobe, Yutaka, Ito, Takumi, Morishita, Makoto, Nakamura, Yuki, Oda, Yusuke, Suzuki, Jun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Shirafuji, Atsushi
Watanobe, Yutaka
Ito, Takumi
Morishita, Makoto
Nakamura, Yuki
Oda, Yusuke
Suzuki, Jun
description Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this research question, we conduct experiments to understand the robustness of several popular LLMs, CodeGen and GPT-3.5 series models, capable of tackling code generation tasks in introductory programming problems. Our experimental results show that CodeGen and Codex are sensitive to the superficial modifications of problem descriptions and significantly impact code generation performance. Furthermore, we observe that Codex relies on variable names, as randomized variables decrease the solved rate significantly. However, the state-of-the-art (SOTA) models, such as InstructGPT and ChatGPT, show higher robustness to superficial modifications and have an outstanding capability for solving programming problems. This highlights the fact that slight modifications to the prompts given to the LLMs can greatly affect code generation performance, and careful formatting of prompts is essential for high-quality code generation, while the SOTA models are becoming more robust to perturbations.
doi_str_mv 10.48550/arxiv.2306.14583
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2306_14583</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2306_14583</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-a3d6d4ea8c7835437aa37ad89221c3c3d1d756a2915417b1783c584d5ebbfd3f3</originalsourceid><addsrcrecordid>eNotj01uwjAQhb1hUQEH6Kq-QNI4Y8dmWSFKK6VqRdlH49hOIzkxsgHR2zdQFu9n8Wakj5BHVuRcCVE8Y7z057yEosoZFwoeyG5zOfgQ-7Gjxx9Ld0Gf0nG0KdHgaI2xs5OP3Qmn8hGM9Ym6EOl38OfrzVcMXcRhuHft7ZAWZObQJ7u855zsXzf79VtWf27f1y91hpWEDMFUhltUrVQgOEjESUatypK10IJhRooKyxUTnEnNplUrFDfCau0MOJiTp_-3N6jmEPsB429zhWtucPAHE6RKhQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Exploring the Robustness of Large Language Models for Solving Programming Problems</title><source>arXiv.org</source><creator>Shirafuji, Atsushi ; Watanobe, Yutaka ; Ito, Takumi ; Morishita, Makoto ; Nakamura, Yuki ; Oda, Yusuke ; Suzuki, Jun</creator><creatorcontrib>Shirafuji, Atsushi ; Watanobe, Yutaka ; Ito, Takumi ; Morishita, Makoto ; Nakamura, Yuki ; Oda, Yusuke ; Suzuki, Jun</creatorcontrib><description>Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this research question, we conduct experiments to understand the robustness of several popular LLMs, CodeGen and GPT-3.5 series models, capable of tackling code generation tasks in introductory programming problems. Our experimental results show that CodeGen and Codex are sensitive to the superficial modifications of problem descriptions and significantly impact code generation performance. Furthermore, we observe that Codex relies on variable names, as randomized variables decrease the solved rate significantly. However, the state-of-the-art (SOTA) models, such as InstructGPT and ChatGPT, show higher robustness to superficial modifications and have an outstanding capability for solving programming problems. This highlights the fact that slight modifications to the prompts given to the LLMs can greatly affect code generation performance, and careful formatting of prompts is essential for high-quality code generation, while the SOTA models are becoming more robust to perturbations.</description><identifier>DOI: 10.48550/arxiv.2306.14583</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Software Engineering</subject><creationdate>2023-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2306.14583$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2306.14583$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Shirafuji, Atsushi</creatorcontrib><creatorcontrib>Watanobe, Yutaka</creatorcontrib><creatorcontrib>Ito, Takumi</creatorcontrib><creatorcontrib>Morishita, Makoto</creatorcontrib><creatorcontrib>Nakamura, Yuki</creatorcontrib><creatorcontrib>Oda, Yusuke</creatorcontrib><creatorcontrib>Suzuki, Jun</creatorcontrib><title>Exploring the Robustness of Large Language Models for Solving Programming Problems</title><description>Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this research question, we conduct experiments to understand the robustness of several popular LLMs, CodeGen and GPT-3.5 series models, capable of tackling code generation tasks in introductory programming problems. Our experimental results show that CodeGen and Codex are sensitive to the superficial modifications of problem descriptions and significantly impact code generation performance. Furthermore, we observe that Codex relies on variable names, as randomized variables decrease the solved rate significantly. However, the state-of-the-art (SOTA) models, such as InstructGPT and ChatGPT, show higher robustness to superficial modifications and have an outstanding capability for solving programming problems. This highlights the fact that slight modifications to the prompts given to the LLMs can greatly affect code generation performance, and careful formatting of prompts is essential for high-quality code generation, while the SOTA models are becoming more robust to perturbations.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Software Engineering</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj01uwjAQhb1hUQEH6Kq-QNI4Y8dmWSFKK6VqRdlH49hOIzkxsgHR2zdQFu9n8Wakj5BHVuRcCVE8Y7z057yEosoZFwoeyG5zOfgQ-7Gjxx9Ld0Gf0nG0KdHgaI2xs5OP3Qmn8hGM9Ym6EOl38OfrzVcMXcRhuHft7ZAWZObQJ7u855zsXzf79VtWf27f1y91hpWEDMFUhltUrVQgOEjESUatypK10IJhRooKyxUTnEnNplUrFDfCau0MOJiTp_-3N6jmEPsB429zhWtucPAHE6RKhQ</recordid><startdate>20230626</startdate><enddate>20230626</enddate><creator>Shirafuji, Atsushi</creator><creator>Watanobe, Yutaka</creator><creator>Ito, Takumi</creator><creator>Morishita, Makoto</creator><creator>Nakamura, Yuki</creator><creator>Oda, Yusuke</creator><creator>Suzuki, Jun</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230626</creationdate><title>Exploring the Robustness of Large Language Models for Solving Programming Problems</title><author>Shirafuji, Atsushi ; Watanobe, Yutaka ; Ito, Takumi ; Morishita, Makoto ; Nakamura, Yuki ; Oda, Yusuke ; Suzuki, Jun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-a3d6d4ea8c7835437aa37ad89221c3c3d1d756a2915417b1783c584d5ebbfd3f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Software Engineering</topic><toplevel>online_resources</toplevel><creatorcontrib>Shirafuji, Atsushi</creatorcontrib><creatorcontrib>Watanobe, Yutaka</creatorcontrib><creatorcontrib>Ito, Takumi</creatorcontrib><creatorcontrib>Morishita, Makoto</creatorcontrib><creatorcontrib>Nakamura, Yuki</creatorcontrib><creatorcontrib>Oda, Yusuke</creatorcontrib><creatorcontrib>Suzuki, Jun</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shirafuji, Atsushi</au><au>Watanobe, Yutaka</au><au>Ito, Takumi</au><au>Morishita, Makoto</au><au>Nakamura, Yuki</au><au>Oda, Yusuke</au><au>Suzuki, Jun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Exploring the Robustness of Large Language Models for Solving Programming Problems</atitle><date>2023-06-26</date><risdate>2023</risdate><abstract>Using large language models (LLMs) for source code has recently gained attention. LLMs, such as Transformer-based models like Codex and ChatGPT, have been shown to be highly capable of solving a wide range of programming problems. However, the extent to which LLMs understand problem descriptions and generate programs accordingly or just retrieve source code from the most relevant problem in training data based on superficial cues has not been discovered yet. To explore this research question, we conduct experiments to understand the robustness of several popular LLMs, CodeGen and GPT-3.5 series models, capable of tackling code generation tasks in introductory programming problems. Our experimental results show that CodeGen and Codex are sensitive to the superficial modifications of problem descriptions and significantly impact code generation performance. Furthermore, we observe that Codex relies on variable names, as randomized variables decrease the solved rate significantly. However, the state-of-the-art (SOTA) models, such as InstructGPT and ChatGPT, show higher robustness to superficial modifications and have an outstanding capability for solving programming problems. This highlights the fact that slight modifications to the prompts given to the LLMs can greatly affect code generation performance, and careful formatting of prompts is essential for high-quality code generation, while the SOTA models are becoming more robust to perturbations.</abstract><doi>10.48550/arxiv.2306.14583</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2306.14583
ispartof
issn
language eng
recordid cdi_arxiv_primary_2306_14583
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Software Engineering
title Exploring the Robustness of Large Language Models for Solving Programming Problems
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T18%3A18%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Exploring%20the%20Robustness%20of%20Large%20Language%20Models%20for%20Solving%20Programming%20Problems&rft.au=Shirafuji,%20Atsushi&rft.date=2023-06-26&rft_id=info:doi/10.48550/arxiv.2306.14583&rft_dat=%3Carxiv_GOX%3E2306_14583%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true