ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges
Recent advancements in large multimodal models (LMMs) have showcased impressive code generation capabilities, primarily evaluated through image-to-code benchmarks. However, these benchmarks are limited to specific visual programming scenarios where the logic reasoning and the multimodal understandin...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent advancements in large multimodal models (LMMs) have showcased
impressive code generation capabilities, primarily evaluated through
image-to-code benchmarks. However, these benchmarks are limited to specific
visual programming scenarios where the logic reasoning and the multimodal
understanding capacities are split apart. To fill this gap, we propose
ScratchEval, a novel benchmark designed to evaluate the visual programming
reasoning ability of LMMs. ScratchEval is based on Scratch, a block-based
visual programming language widely used in children's programming education. By
integrating visual elements and embedded programming logic, ScratchEval
requires the model to process both visual information and code structure,
thereby comprehensively evaluating its programming intent understanding
ability. Our evaluation approach goes beyond the traditional image-to-code
mapping and focuses on unified logical thinking and problem-solving abilities,
providing a more comprehensive and challenging framework for evaluating the
visual programming ability of LMMs. ScratchEval not only fills the gap in
existing evaluation methods, but also provides new insights for the future
development of LMMs in the field of visual programming. Our benchmark can be
accessed at https://github.com/HKBUNLP/ScratchEval . |
---|---|
DOI: | 10.48550/arxiv.2411.18932 |