ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges
Recent advancements in large multimodal models (LMMs) have showcased impressive code generation capabilities, primarily evaluated through image-to-code benchmarks. However, these benchmarks are limited to specific visual programming scenarios where the logic reasoning and the multimodal understandin...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Fu, Rao Luo, Ziyang Lin, Hongzhan Ye, Zhen Ma, Jing |
description | Recent advancements in large multimodal models (LMMs) have showcased
impressive code generation capabilities, primarily evaluated through
image-to-code benchmarks. However, these benchmarks are limited to specific
visual programming scenarios where the logic reasoning and the multimodal
understanding capacities are split apart. To fill this gap, we propose
ScratchEval, a novel benchmark designed to evaluate the visual programming
reasoning ability of LMMs. ScratchEval is based on Scratch, a block-based
visual programming language widely used in children's programming education. By
integrating visual elements and embedded programming logic, ScratchEval
requires the model to process both visual information and code structure,
thereby comprehensively evaluating its programming intent understanding
ability. Our evaluation approach goes beyond the traditional image-to-code
mapping and focuses on unified logical thinking and problem-solving abilities,
providing a more comprehensive and challenging framework for evaluating the
visual programming ability of LMMs. ScratchEval not only fills the gap in
existing evaluation methods, but also provides new insights for the future
development of LMMs in the field of visual programming. Our benchmark can be
accessed at https://github.com/HKBUNLP/ScratchEval . |
doi_str_mv | 10.48550/arxiv.2411.18932 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2411_18932</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2411_18932</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2411_189323</originalsourceid><addsrcrecordid>eNqFjs0KgkAURmfTIqoHaNV9gcy_oNpEiNUiQVDayiUnZ2BG4zpavn0q7Vt98HHgHMaWjm35u-3W3iB9ZGu5vuNYzm7vuVNmkgeheYiwRXWAE3G4xOnaryDRSIYTGIElRB0EQqr8CAPXoJFlATekgkPUKCN1laOCqMq5quEtjYC7rJv-iqkqCLUe-ECgUrwseD1nkyeqmi9-O2Orc5gG1_XYl71I9vIuGzqzsdP7T3wBgyhItA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges</title><source>arXiv.org</source><creator>Fu, Rao ; Luo, Ziyang ; Lin, Hongzhan ; Ye, Zhen ; Ma, Jing</creator><creatorcontrib>Fu, Rao ; Luo, Ziyang ; Lin, Hongzhan ; Ye, Zhen ; Ma, Jing</creatorcontrib><description>Recent advancements in large multimodal models (LMMs) have showcased
impressive code generation capabilities, primarily evaluated through
image-to-code benchmarks. However, these benchmarks are limited to specific
visual programming scenarios where the logic reasoning and the multimodal
understanding capacities are split apart. To fill this gap, we propose
ScratchEval, a novel benchmark designed to evaluate the visual programming
reasoning ability of LMMs. ScratchEval is based on Scratch, a block-based
visual programming language widely used in children's programming education. By
integrating visual elements and embedded programming logic, ScratchEval
requires the model to process both visual information and code structure,
thereby comprehensively evaluating its programming intent understanding
ability. Our evaluation approach goes beyond the traditional image-to-code
mapping and focuses on unified logical thinking and problem-solving abilities,
providing a more comprehensive and challenging framework for evaluating the
visual programming ability of LMMs. ScratchEval not only fills the gap in
existing evaluation methods, but also provides new insights for the future
development of LMMs in the field of visual programming. Our benchmark can be
accessed at https://github.com/HKBUNLP/ScratchEval .</description><identifier>DOI: 10.48550/arxiv.2411.18932</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2411.18932$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2411.18932$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Fu, Rao</creatorcontrib><creatorcontrib>Luo, Ziyang</creatorcontrib><creatorcontrib>Lin, Hongzhan</creatorcontrib><creatorcontrib>Ye, Zhen</creatorcontrib><creatorcontrib>Ma, Jing</creatorcontrib><title>ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges</title><description>Recent advancements in large multimodal models (LMMs) have showcased
impressive code generation capabilities, primarily evaluated through
image-to-code benchmarks. However, these benchmarks are limited to specific
visual programming scenarios where the logic reasoning and the multimodal
understanding capacities are split apart. To fill this gap, we propose
ScratchEval, a novel benchmark designed to evaluate the visual programming
reasoning ability of LMMs. ScratchEval is based on Scratch, a block-based
visual programming language widely used in children's programming education. By
integrating visual elements and embedded programming logic, ScratchEval
requires the model to process both visual information and code structure,
thereby comprehensively evaluating its programming intent understanding
ability. Our evaluation approach goes beyond the traditional image-to-code
mapping and focuses on unified logical thinking and problem-solving abilities,
providing a more comprehensive and challenging framework for evaluating the
visual programming ability of LMMs. ScratchEval not only fills the gap in
existing evaluation methods, but also provides new insights for the future
development of LMMs in the field of visual programming. Our benchmark can be
accessed at https://github.com/HKBUNLP/ScratchEval .</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjs0KgkAURmfTIqoHaNV9gcy_oNpEiNUiQVDayiUnZ2BG4zpavn0q7Vt98HHgHMaWjm35u-3W3iB9ZGu5vuNYzm7vuVNmkgeheYiwRXWAE3G4xOnaryDRSIYTGIElRB0EQqr8CAPXoJFlATekgkPUKCN1laOCqMq5quEtjYC7rJv-iqkqCLUe-ECgUrwseD1nkyeqmi9-O2Orc5gG1_XYl71I9vIuGzqzsdP7T3wBgyhItA</recordid><startdate>20241128</startdate><enddate>20241128</enddate><creator>Fu, Rao</creator><creator>Luo, Ziyang</creator><creator>Lin, Hongzhan</creator><creator>Ye, Zhen</creator><creator>Ma, Jing</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241128</creationdate><title>ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges</title><author>Fu, Rao ; Luo, Ziyang ; Lin, Hongzhan ; Ye, Zhen ; Ma, Jing</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2411_189323</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Fu, Rao</creatorcontrib><creatorcontrib>Luo, Ziyang</creatorcontrib><creatorcontrib>Lin, Hongzhan</creatorcontrib><creatorcontrib>Ye, Zhen</creatorcontrib><creatorcontrib>Ma, Jing</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Fu, Rao</au><au>Luo, Ziyang</au><au>Lin, Hongzhan</au><au>Ye, Zhen</au><au>Ma, Jing</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges</atitle><date>2024-11-28</date><risdate>2024</risdate><abstract>Recent advancements in large multimodal models (LMMs) have showcased
impressive code generation capabilities, primarily evaluated through
image-to-code benchmarks. However, these benchmarks are limited to specific
visual programming scenarios where the logic reasoning and the multimodal
understanding capacities are split apart. To fill this gap, we propose
ScratchEval, a novel benchmark designed to evaluate the visual programming
reasoning ability of LMMs. ScratchEval is based on Scratch, a block-based
visual programming language widely used in children's programming education. By
integrating visual elements and embedded programming logic, ScratchEval
requires the model to process both visual information and code structure,
thereby comprehensively evaluating its programming intent understanding
ability. Our evaluation approach goes beyond the traditional image-to-code
mapping and focuses on unified logical thinking and problem-solving abilities,
providing a more comprehensive and challenging framework for evaluating the
visual programming ability of LMMs. ScratchEval not only fills the gap in
existing evaluation methods, but also provides new insights for the future
development of LMMs in the field of visual programming. Our benchmark can be
accessed at https://github.com/HKBUNLP/ScratchEval .</abstract><doi>10.48550/arxiv.2411.18932</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2411.18932 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2411_18932 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition |
title | ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T02%3A33%3A04IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=ScratchEval:%20Are%20GPT-4o%20Smarter%20than%20My%20Child?%20Evaluating%20Large%20Multimodal%20Models%20with%20Visual%20Programming%20Challenges&rft.au=Fu,%20Rao&rft.date=2024-11-28&rft_id=info:doi/10.48550/arxiv.2411.18932&rft_dat=%3Carxiv_GOX%3E2411_18932%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |