L2CEval : Evaluating Language-to-Code Generation Capabilities of Large Language Models

Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs. Despite promising results, there is a notable lack of a comprehensive evaluation of these models’ language-to-code generati...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Transactions of the Association for Computational Linguistics 2024-10, Vol.12, p.1311-1329
Hauptverfasser:	Ni, Ansong, Yin, Pengcheng, Zhao, Yilun, Riddell, Martin, Feng, Troy, Shen, Rui, Yin, Stephen, Liu, Ye, Yavuz, Semih, Xiong, Caiming, Joty, Shafiq, Zhou, Yingbo, Radev, Dragomir, Cohan, Arman
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs. Despite promising results, there is a notable lack of a comprehensive evaluation of these models’ language-to-code generation capabilities. Existing studies often focus on specific tasks, model architectures, or learning paradigms, leading to a fragmented understanding of the overall landscape. In this work, we present , a systematic evaluation of the language-to-code generation capabilities of LLMs on 7 tasks across the domain spectrum of semantic parsing, math reasoning, and Python programming, analyzing the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition, we assess confidence calibration, and conduct human evaluations to identify typical failures across different tasks and models. offers a comprehensive understanding of the capabilities and limitations of LLMs in language-to-code generation. We release the evaluation framework and all model outputs, hoping to lay the groundwork for further future research. All future evaluations ( LLaMA-3, StarCoder2, etc) will be updated on the project website: .
ISSN:	2307-387X 2307-387X
DOI:	10.1162/tacl_a_00705