MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal task...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Chen, Jiacheng, Liang, Tianhao, Siu, Sherman, Wang, Zhengqing, Wang, Kai, Wang, Yubo, Ni, Yuansheng, Zhu, Wang, Jiang, Ziyan, Lyu, Bohan, Jiang, Dongfu, He, Xuan, Liu, Yuan, Hu, Hexiang, Yue, Xiang, Chen, Wenhu
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Chen, Jiacheng Liang, Tianhao Siu, Sherman Wang, Zhengqing Wang, Kai Wang, Yubo Ni, Yuansheng Zhu, Wang Jiang, Ziyan Lyu, Bohan Jiang, Dongfu He, Xuan Liu, Yuan Hu, Hexiang Yue, Xiang Chen, Wenhu
description	We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.
doi_str_mv	10.48550/arxiv.2410.10563
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2410_10563</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2410_10563</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2410_105633</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMgEKGBqYmhlzMrj5uro76jql5iVnWCkEJyfmZOalK_iW5pRk5uanJOYouJYl5pQmlmTm5ymU5Cvkl6UWKZgaGCgEpSbm6IbnF-WkKIQkFmcX8zCwpiXmFKfyQmluBnk31xBnD12wjfEFRZm5iUWV8SCb48E2GxNWAQCEzzcB</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks</title><source>arXiv.org</source><creator>Chen, Jiacheng ; Liang, Tianhao ; Siu, Sherman ; Wang, Zhengqing ; Wang, Kai ; Wang, Yubo ; Ni, Yuansheng ; Zhu, Wang ; Jiang, Ziyan ; Lyu, Bohan ; Jiang, Dongfu ; He, Xuan ; Liu, Yuan ; Hu, Hexiang ; Yue, Xiang ; Chen, Wenhu</creator><creatorcontrib>Chen, Jiacheng ; Liang, Tianhao ; Siu, Sherman ; Wang, Zhengqing ; Wang, Kai ; Wang, Yubo ; Ni, Yuansheng ; Zhu, Wang ; Jiang, Ziyan ; Lyu, Bohan ; Jiang, Dongfu ; He, Xuan ; Liu, Yuan ; Hu, Hexiang ; Yue, Xiang ; Chen, Wenhu</creatorcontrib><description>We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.</description><identifier>DOI: 10.48550/arxiv.2410.10563</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2410.10563$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2410.10563$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Jiacheng</creatorcontrib><creatorcontrib>Liang, Tianhao</creatorcontrib><creatorcontrib>Siu, Sherman</creatorcontrib><creatorcontrib>Wang, Zhengqing</creatorcontrib><creatorcontrib>Wang, Kai</creatorcontrib><creatorcontrib>Wang, Yubo</creatorcontrib><creatorcontrib>Ni, Yuansheng</creatorcontrib><creatorcontrib>Zhu, Wang</creatorcontrib><creatorcontrib>Jiang, Ziyan</creatorcontrib><creatorcontrib>Lyu, Bohan</creatorcontrib><creatorcontrib>Jiang, Dongfu</creatorcontrib><creatorcontrib>He, Xuan</creatorcontrib><creatorcontrib>Liu, Yuan</creatorcontrib><creatorcontrib>Hu, Hexiang</creatorcontrib><creatorcontrib>Yue, Xiang</creatorcontrib><creatorcontrib>Chen, Wenhu</creatorcontrib><title>MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks</title><description>We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMgEKGBqYmhlzMrj5uro76jql5iVnWCkEJyfmZOalK_iW5pRk5uanJOYouJYl5pQmlmTm5ymU5Cvkl6UWKZgaGCgEpSbm6IbnF-WkKIQkFmcX8zCwpiXmFKfyQmluBnk31xBnD12wjfEFRZm5iUWV8SCb48E2GxNWAQCEzzcB</recordid><startdate>20241014</startdate><enddate>20241014</enddate><creator>Chen, Jiacheng</creator><creator>Liang, Tianhao</creator><creator>Siu, Sherman</creator><creator>Wang, Zhengqing</creator><creator>Wang, Kai</creator><creator>Wang, Yubo</creator><creator>Ni, Yuansheng</creator><creator>Zhu, Wang</creator><creator>Jiang, Ziyan</creator><creator>Lyu, Bohan</creator><creator>Jiang, Dongfu</creator><creator>He, Xuan</creator><creator>Liu, Yuan</creator><creator>Hu, Hexiang</creator><creator>Yue, Xiang</creator><creator>Chen, Wenhu</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241014</creationdate><title>MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks</title><author>Chen, Jiacheng ; Liang, Tianhao ; Siu, Sherman ; Wang, Zhengqing ; Wang, Kai ; Wang, Yubo ; Ni, Yuansheng ; Zhu, Wang ; Jiang, Ziyan ; Lyu, Bohan ; Jiang, Dongfu ; He, Xuan ; Liu, Yuan ; Hu, Hexiang ; Yue, Xiang ; Chen, Wenhu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2410_105633</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Jiacheng</creatorcontrib><creatorcontrib>Liang, Tianhao</creatorcontrib><creatorcontrib>Siu, Sherman</creatorcontrib><creatorcontrib>Wang, Zhengqing</creatorcontrib><creatorcontrib>Wang, Kai</creatorcontrib><creatorcontrib>Wang, Yubo</creatorcontrib><creatorcontrib>Ni, Yuansheng</creatorcontrib><creatorcontrib>Zhu, Wang</creatorcontrib><creatorcontrib>Jiang, Ziyan</creatorcontrib><creatorcontrib>Lyu, Bohan</creatorcontrib><creatorcontrib>Jiang, Dongfu</creatorcontrib><creatorcontrib>He, Xuan</creatorcontrib><creatorcontrib>Liu, Yuan</creatorcontrib><creatorcontrib>Hu, Hexiang</creatorcontrib><creatorcontrib>Yue, Xiang</creatorcontrib><creatorcontrib>Chen, Wenhu</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Jiacheng</au><au>Liang, Tianhao</au><au>Siu, Sherman</au><au>Wang, Zhengqing</au><au>Wang, Kai</au><au>Wang, Yubo</au><au>Ni, Yuansheng</au><au>Zhu, Wang</au><au>Jiang, Ziyan</au><au>Lyu, Bohan</au><au>Jiang, Dongfu</au><au>He, Xuan</au><au>Liu, Yuan</au><au>Hu, Hexiang</au><au>Yue, Xiang</au><au>Chen, Wenhu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks</atitle><date>2024-10-14</date><risdate>2024</risdate><abstract>We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, \LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.</abstract><doi>10.48550/arxiv.2410.10563</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2410.10563
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2410_10563
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-22T06%3A53%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MEGA-Bench:%20Scaling%20Multimodal%20Evaluation%20to%20over%20500%20Real-World%20Tasks&rft.au=Chen,%20Jiacheng&rft.date=2024-10-14&rft_id=info:doi/10.48550/arxiv.2410.10563&rft_dat=%3Carxiv_GOX%3E2410_10563%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true