ComPile: A Large IR Dataset from Production Sources
Code is increasingly becoming a core data modality of modern machine learning research impacting not only the way we write code with conversational agents like OpenAI's ChatGPT, Google's Bard, or Anthropic's Claude, the way we translate code from one language into another, but also th...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Code is increasingly becoming a core data modality of modern machine learning
research impacting not only the way we write code with conversational agents
like OpenAI's ChatGPT, Google's Bard, or Anthropic's Claude, the way we
translate code from one language into another, but also the compiler
infrastructure underlying the language. While modeling approaches may vary and
representations differ, the targeted tasks often remain the same within the
individual classes of models. Relying solely on the ability of modern models to
extract information from unstructured code does not take advantage of 70 years
of programming language and compiler development by not utilizing the structure
inherent to programs in the data collection. This detracts from the performance
of models working over a tokenized representation of input code and precludes
the use of these models in the compiler itself. To work towards the first
intermediate representation (IR) based models, we fully utilize the LLVM
compiler infrastructure, shared by a number of languages, to generate a 182B
token dataset of LLVM IR. We generated this dataset from programming languages
built on the shared LLVM infrastructure, including Rust, Swift, Julia, and
C/C++, by hooking into LLVM code generation either through the language's
package manager or the compiler directly to extract the dataset of intermediate
representations from production grade programs. Statistical analysis proves the
utility of our dataset not only for large language model training, but also for
the introspection into the code generation process itself with the dataset
showing great promise for machine-learned compiler components. |
---|---|
DOI: | 10.48550/arxiv.2309.15432 |