COMEX: A Tool for Generating Customized Source Code Representations
Learning effective representations of source code is critical for any Machine Learning for Software Engineering (ML4SE) system. Inspired by natural language processing, large language models (LLMs) like Codex and CodeGen treat code as generic sequences of text and are trained on huge corpora of code...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Learning effective representations of source code is critical for any Machine
Learning for Software Engineering (ML4SE) system. Inspired by natural language
processing, large language models (LLMs) like Codex and CodeGen treat code as
generic sequences of text and are trained on huge corpora of code data,
achieving state of the art performance on several software engineering (SE)
tasks. However, valid source code, unlike natural language, follows a strict
structure and pattern governed by the underlying grammar of the programming
language. Current LLMs do not exploit this property of the source code as they
treat code like a sequence of tokens and overlook key structural and semantic
properties of code that can be extracted from code-views like the Control Flow
Graph (CFG), Data Flow Graph (DFG), Abstract Syntax Tree (AST), etc.
Unfortunately, the process of generating and integrating code-views for every
programming language is cumbersome and time consuming. To overcome this
barrier, we propose our tool COMEX - a framework that allows researchers and
developers to create and combine multiple code-views which can be used by
machine learning (ML) models for various SE tasks. Some salient features of our
tool are: (i) it works directly on source code (which need not be compilable),
(ii) it currently supports Java and C#, (iii) it can analyze both method-level
snippets and program-level snippets by using both intra-procedural and
inter-procedural analysis, and (iv) it is easily extendable to other languages
as it is built on tree-sitter - a widely used incremental parser that supports
over 40 languages. We believe this easy-to-use code-view generation and
customization tool will give impetus to research in source code representation
learning methods and ML4SE.
Tool: https://pypi.org/project/comex - GitHub:
https://github.com/IBM/tree-sitter-codeviews - Demo:
https://youtu.be/GER6U87FVbU |
---|---|
DOI: | 10.48550/arxiv.2307.04693 |