Natural Language to Code Generation in Interactive Data Science Notebooks

Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural la...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Yin, Pengcheng, Li, Wen-Ding, Xiao, Kefan, Rao, Abhishek, Wen, Yeming, Shi, Kensen, Howland, Joshua, Bailey, Paige, Catasta, Michele, Michalewski, Henryk, Polozov, Alex, Sutton, Charles
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Software Engineering
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Yin, Pengcheng Li, Wen-Ding Xiao, Kefan Rao, Abhishek Wen, Yeming Shi, Kensen Howland, Joshua Bailey, Paige Catasta, Michele Michalewski, Henryk Polozov, Alex Sutton, Charles
description	Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.
doi_str_mv	10.48550/arxiv.2212.09248
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2212_09248</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2212_09248</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-5729ed00913cc228e6df72a34e030ffce828b6bd6c098783ac271c8fa0e6b5383</originalsourceid><addsrcrecordid>eNotz7FOwzAUhWEvDKjwAEz4BRKc68S5GVGAEikqA92jG-e6sig2ct0K3h4oTEf_cqRPiJtKlTU2jbqj9OlPJUAFpeqgxksxbCgfE-3lSGF3pB3LHGUfF5ZrDpwo-xikD3II-ads9ieWD5RJvlrPwbLcxMxzjG-HK3HhaH_g6_9die3T47Z_LsaX9dDfjwWZFoumhY4XpbpKWwuAbBbXAumalVbOWUbA2cyLsarDFjVZaCuLjhSbudGoV-L27_ZsmT6Sf6f0Nf2aprNJfwOdIUaa</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Natural Language to Code Generation in Interactive Data Science Notebooks</title><source>arXiv.org</source><creator>Yin, Pengcheng ; Li, Wen-Ding ; Xiao, Kefan ; Rao, Abhishek ; Wen, Yeming ; Shi, Kensen ; Howland, Joshua ; Bailey, Paige ; Catasta, Michele ; Michalewski, Henryk ; Polozov, Alex ; Sutton, Charles</creator><creatorcontrib>Yin, Pengcheng ; Li, Wen-Ding ; Xiao, Kefan ; Rao, Abhishek ; Wen, Yeming ; Shi, Kensen ; Howland, Joshua ; Bailey, Paige ; Catasta, Michele ; Michalewski, Henryk ; Polozov, Alex ; Sutton, Charles</creatorcontrib><description>Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.</description><identifier>DOI: 10.48550/arxiv.2212.09248</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Software Engineering</subject><creationdate>2022-12</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2212.09248$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2212.09248$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Yin, Pengcheng</creatorcontrib><creatorcontrib>Li, Wen-Ding</creatorcontrib><creatorcontrib>Xiao, Kefan</creatorcontrib><creatorcontrib>Rao, Abhishek</creatorcontrib><creatorcontrib>Wen, Yeming</creatorcontrib><creatorcontrib>Shi, Kensen</creatorcontrib><creatorcontrib>Howland, Joshua</creatorcontrib><creatorcontrib>Bailey, Paige</creatorcontrib><creatorcontrib>Catasta, Michele</creatorcontrib><creatorcontrib>Michalewski, Henryk</creatorcontrib><creatorcontrib>Polozov, Alex</creatorcontrib><creatorcontrib>Sutton, Charles</creatorcontrib><title>Natural Language to Code Generation in Interactive Data Science Notebooks</title><description>Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Software Engineering</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7FOwzAUhWEvDKjwAEz4BRKc68S5GVGAEikqA92jG-e6sig2ct0K3h4oTEf_cqRPiJtKlTU2jbqj9OlPJUAFpeqgxksxbCgfE-3lSGF3pB3LHGUfF5ZrDpwo-xikD3II-ads9ieWD5RJvlrPwbLcxMxzjG-HK3HhaH_g6_9die3T47Z_LsaX9dDfjwWZFoumhY4XpbpKWwuAbBbXAumalVbOWUbA2cyLsarDFjVZaCuLjhSbudGoV-L27_ZsmT6Sf6f0Nf2aprNJfwOdIUaa</recordid><startdate>20221219</startdate><enddate>20221219</enddate><creator>Yin, Pengcheng</creator><creator>Li, Wen-Ding</creator><creator>Xiao, Kefan</creator><creator>Rao, Abhishek</creator><creator>Wen, Yeming</creator><creator>Shi, Kensen</creator><creator>Howland, Joshua</creator><creator>Bailey, Paige</creator><creator>Catasta, Michele</creator><creator>Michalewski, Henryk</creator><creator>Polozov, Alex</creator><creator>Sutton, Charles</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20221219</creationdate><title>Natural Language to Code Generation in Interactive Data Science Notebooks</title><author>Yin, Pengcheng ; Li, Wen-Ding ; Xiao, Kefan ; Rao, Abhishek ; Wen, Yeming ; Shi, Kensen ; Howland, Joshua ; Bailey, Paige ; Catasta, Michele ; Michalewski, Henryk ; Polozov, Alex ; Sutton, Charles</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-5729ed00913cc228e6df72a34e030ffce828b6bd6c098783ac271c8fa0e6b5383</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Software Engineering</topic><toplevel>online_resources</toplevel><creatorcontrib>Yin, Pengcheng</creatorcontrib><creatorcontrib>Li, Wen-Ding</creatorcontrib><creatorcontrib>Xiao, Kefan</creatorcontrib><creatorcontrib>Rao, Abhishek</creatorcontrib><creatorcontrib>Wen, Yeming</creatorcontrib><creatorcontrib>Shi, Kensen</creatorcontrib><creatorcontrib>Howland, Joshua</creatorcontrib><creatorcontrib>Bailey, Paige</creatorcontrib><creatorcontrib>Catasta, Michele</creatorcontrib><creatorcontrib>Michalewski, Henryk</creatorcontrib><creatorcontrib>Polozov, Alex</creatorcontrib><creatorcontrib>Sutton, Charles</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yin, Pengcheng</au><au>Li, Wen-Ding</au><au>Xiao, Kefan</au><au>Rao, Abhishek</au><au>Wen, Yeming</au><au>Shi, Kensen</au><au>Howland, Joshua</au><au>Bailey, Paige</au><au>Catasta, Michele</au><au>Michalewski, Henryk</au><au>Polozov, Alex</au><au>Sutton, Charles</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Natural Language to Code Generation in Interactive Data Science Notebooks</atitle><date>2022-12-19</date><risdate>2022</risdate><abstract>Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.</abstract><doi>10.48550/arxiv.2212.09248</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2212.09248
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2212_09248
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Software Engineering
title	Natural Language to Code Generation in Interactive Data Science Notebooks
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T07%3A04%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Natural%20Language%20to%20Code%20Generation%20in%20Interactive%20Data%20Science%20Notebooks&rft.au=Yin,%20Pengcheng&rft.date=2022-12-19&rft_id=info:doi/10.48550/arxiv.2212.09248&rft_dat=%3Carxiv_GOX%3E2212_09248%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true