MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems
Programming often involves converting detailed and complex specifications into code, a process during which developers typically utilize visual aids to more effectively convey concepts. While recent developments in Large Multimodal Models have demonstrated remarkable abilities in visual reasoning an...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Programming often involves converting detailed and complex specifications
into code, a process during which developers typically utilize visual aids to
more effectively convey concepts. While recent developments in Large Multimodal
Models have demonstrated remarkable abilities in visual reasoning and
mathematical tasks, there is little work on investigating whether these models
can effectively interpret visual elements for code generation. To this end, we
present MMCode, the first multi-modal coding dataset for evaluating algorithmic
problem-solving skills in visually rich contexts. MMCode contains 3,548
questions and 6,620 images collected from real-world programming challenges
harvested from 10 code competition websites, presenting significant challenges
due to the extreme demand for reasoning abilities. Our experiment results show
that current state-of-the-art models struggle to solve these problems. The
results highlight the lack of powerful vision-code models, and we hope MMCode
can serve as an inspiration for future works in this domain. The data and code
are publicly available at https://github.com/likaixin2000/MMCode. |
---|---|
DOI: | 10.48550/arxiv.2404.09486 |