MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains

Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underly...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-08
Hauptverfasser:	Yin, Guoli, Bai, Haoping, Ma, Shuang, Feng, Nan, Sun, Yanchao, Xu, Zhaoyang, Shen, Ma, Lu, Jiarui, Kong, Xiang, Zhang, Aonan, Dian Ang Yap, zhang, Yizhe, Ahnert, Karsten, Kamath, Vik, Berglund, Mathias, Walsh, Dominic, Gindele, Tobias, Wiest, Juergen, Lai, Zhengfeng, Wang, Xiaoming, Jiulong Shan, Cao, Meng, Pang, Ruoming, Wang, Zirui
Format:	Artikel
Sprache:	eng
Schlagworte:	Benchmarks Environment models Large language models Machine learning Performance evaluation Reagents Task complexity
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Yin, Guoli Bai, Haoping Ma, Shuang Feng, Nan Sun, Yanchao Xu, Zhaoyang Shen, Ma Lu, Jiarui Kong, Xiang Zhang, Aonan Dian Ang Yap zhang, Yizhe Ahnert, Karsten Kamath, Vik Berglund, Mathias Walsh, Dominic Gindele, Tobias Wiest, Juergen Lai, Zhengfeng Wang, Xiaoming Jiulong Shan Cao, Meng Pang, Ruoming Wang, Zirui
description	Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics, and covers five essential capabilities: Understanding, Reasoning, Planning, Problem-solving, and Self-correction. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 18 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Datasets and evaluation scripts of MMAU are released at https://github.com/apple/axlearn/tree/main/docs/research/mmau.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3086453874</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3086453874</sourcerecordid><originalsourceid>FETCH-proquest_journals_30864538743</originalsourceid><addsrcrecordid>eNqNyr0OgjAUQOHGxESivMNNnElqy1_cKqgsbDqTSooWocXe4vPr4AM4neE7CxIwzndRHjO2IiFiTyllacaShAfkXNfiugcBlR00et3CQZn2MUr3BNuBuCvjoZCTvOlBe60QROssIpT6rRwqKO0otcENWXZyQBX-uibb0_FSVNHk7GtW6Jvezs58qeE0T-OE51nM_7s-MRw6Vw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3086453874</pqid></control><display><type>article</type><title>MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains</title><source>Free E- Journals</source><creator>Yin, Guoli ; Bai, Haoping ; Ma, Shuang ; Feng, Nan ; Sun, Yanchao ; Xu, Zhaoyang ; Shen, Ma ; Lu, Jiarui ; Kong, Xiang ; Zhang, Aonan ; Dian Ang Yap ; zhang, Yizhe ; Ahnert, Karsten ; Kamath, Vik ; Berglund, Mathias ; Walsh, Dominic ; Gindele, Tobias ; Wiest, Juergen ; Lai, Zhengfeng ; Wang, Xiaoming ; Jiulong Shan ; Cao, Meng ; Pang, Ruoming ; Wang, Zirui</creator><creatorcontrib>Yin, Guoli ; Bai, Haoping ; Ma, Shuang ; Feng, Nan ; Sun, Yanchao ; Xu, Zhaoyang ; Shen, Ma ; Lu, Jiarui ; Kong, Xiang ; Zhang, Aonan ; Dian Ang Yap ; zhang, Yizhe ; Ahnert, Karsten ; Kamath, Vik ; Berglund, Mathias ; Walsh, Dominic ; Gindele, Tobias ; Wiest, Juergen ; Lai, Zhengfeng ; Wang, Xiaoming ; Jiulong Shan ; Cao, Meng ; Pang, Ruoming ; Wang, Zirui</creatorcontrib><description>Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics, and covers five essential capabilities: Understanding, Reasoning, Planning, Problem-solving, and Self-correction. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 18 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Datasets and evaluation scripts of MMAU are released at https://github.com/apple/axlearn/tree/main/docs/research/mmau.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Benchmarks ; Environment models ; Large language models ; Machine learning ; Performance evaluation ; Reagents ; Task complexity</subject><ispartof>arXiv.org, 2024-08</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Yin, Guoli</creatorcontrib><creatorcontrib>Bai, Haoping</creatorcontrib><creatorcontrib>Ma, Shuang</creatorcontrib><creatorcontrib>Feng, Nan</creatorcontrib><creatorcontrib>Sun, Yanchao</creatorcontrib><creatorcontrib>Xu, Zhaoyang</creatorcontrib><creatorcontrib>Shen, Ma</creatorcontrib><creatorcontrib>Lu, Jiarui</creatorcontrib><creatorcontrib>Kong, Xiang</creatorcontrib><creatorcontrib>Zhang, Aonan</creatorcontrib><creatorcontrib>Dian Ang Yap</creatorcontrib><creatorcontrib>zhang, Yizhe</creatorcontrib><creatorcontrib>Ahnert, Karsten</creatorcontrib><creatorcontrib>Kamath, Vik</creatorcontrib><creatorcontrib>Berglund, Mathias</creatorcontrib><creatorcontrib>Walsh, Dominic</creatorcontrib><creatorcontrib>Gindele, Tobias</creatorcontrib><creatorcontrib>Wiest, Juergen</creatorcontrib><creatorcontrib>Lai, Zhengfeng</creatorcontrib><creatorcontrib>Wang, Xiaoming</creatorcontrib><creatorcontrib>Jiulong Shan</creatorcontrib><creatorcontrib>Cao, Meng</creatorcontrib><creatorcontrib>Pang, Ruoming</creatorcontrib><creatorcontrib>Wang, Zirui</creatorcontrib><title>MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains</title><title>arXiv.org</title><description>Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics, and covers five essential capabilities: Understanding, Reasoning, Planning, Problem-solving, and Self-correction. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 18 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Datasets and evaluation scripts of MMAU are released at https://github.com/apple/axlearn/tree/main/docs/research/mmau.</description><subject>Benchmarks</subject><subject>Environment models</subject><subject>Large language models</subject><subject>Machine learning</subject><subject>Performance evaluation</subject><subject>Reagents</subject><subject>Task complexity</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNyr0OgjAUQOHGxESivMNNnElqy1_cKqgsbDqTSooWocXe4vPr4AM4neE7CxIwzndRHjO2IiFiTyllacaShAfkXNfiugcBlR00et3CQZn2MUr3BNuBuCvjoZCTvOlBe60QROssIpT6rRwqKO0otcENWXZyQBX-uibb0_FSVNHk7GtW6Jvezs58qeE0T-OE51nM_7s-MRw6Vw</recordid><startdate>20240815</startdate><enddate>20240815</enddate><creator>Yin, Guoli</creator><creator>Bai, Haoping</creator><creator>Ma, Shuang</creator><creator>Feng, Nan</creator><creator>Sun, Yanchao</creator><creator>Xu, Zhaoyang</creator><creator>Shen, Ma</creator><creator>Lu, Jiarui</creator><creator>Kong, Xiang</creator><creator>Zhang, Aonan</creator><creator>Dian Ang Yap</creator><creator>zhang, Yizhe</creator><creator>Ahnert, Karsten</creator><creator>Kamath, Vik</creator><creator>Berglund, Mathias</creator><creator>Walsh, Dominic</creator><creator>Gindele, Tobias</creator><creator>Wiest, Juergen</creator><creator>Lai, Zhengfeng</creator><creator>Wang, Xiaoming</creator><creator>Jiulong Shan</creator><creator>Cao, Meng</creator><creator>Pang, Ruoming</creator><creator>Wang, Zirui</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240815</creationdate><title>MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains</title><author>Yin, Guoli ; Bai, Haoping ; Ma, Shuang ; Feng, Nan ; Sun, Yanchao ; Xu, Zhaoyang ; Shen, Ma ; Lu, Jiarui ; Kong, Xiang ; Zhang, Aonan ; Dian Ang Yap ; zhang, Yizhe ; Ahnert, Karsten ; Kamath, Vik ; Berglund, Mathias ; Walsh, Dominic ; Gindele, Tobias ; Wiest, Juergen ; Lai, Zhengfeng ; Wang, Xiaoming ; Jiulong Shan ; Cao, Meng ; Pang, Ruoming ; Wang, Zirui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30864538743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Benchmarks</topic><topic>Environment models</topic><topic>Large language models</topic><topic>Machine learning</topic><topic>Performance evaluation</topic><topic>Reagents</topic><topic>Task complexity</topic><toplevel>online_resources</toplevel><creatorcontrib>Yin, Guoli</creatorcontrib><creatorcontrib>Bai, Haoping</creatorcontrib><creatorcontrib>Ma, Shuang</creatorcontrib><creatorcontrib>Feng, Nan</creatorcontrib><creatorcontrib>Sun, Yanchao</creatorcontrib><creatorcontrib>Xu, Zhaoyang</creatorcontrib><creatorcontrib>Shen, Ma</creatorcontrib><creatorcontrib>Lu, Jiarui</creatorcontrib><creatorcontrib>Kong, Xiang</creatorcontrib><creatorcontrib>Zhang, Aonan</creatorcontrib><creatorcontrib>Dian Ang Yap</creatorcontrib><creatorcontrib>zhang, Yizhe</creatorcontrib><creatorcontrib>Ahnert, Karsten</creatorcontrib><creatorcontrib>Kamath, Vik</creatorcontrib><creatorcontrib>Berglund, Mathias</creatorcontrib><creatorcontrib>Walsh, Dominic</creatorcontrib><creatorcontrib>Gindele, Tobias</creatorcontrib><creatorcontrib>Wiest, Juergen</creatorcontrib><creatorcontrib>Lai, Zhengfeng</creatorcontrib><creatorcontrib>Wang, Xiaoming</creatorcontrib><creatorcontrib>Jiulong Shan</creatorcontrib><creatorcontrib>Cao, Meng</creatorcontrib><creatorcontrib>Pang, Ruoming</creatorcontrib><creatorcontrib>Wang, Zirui</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yin, Guoli</au><au>Bai, Haoping</au><au>Ma, Shuang</au><au>Feng, Nan</au><au>Sun, Yanchao</au><au>Xu, Zhaoyang</au><au>Shen, Ma</au><au>Lu, Jiarui</au><au>Kong, Xiang</au><au>Zhang, Aonan</au><au>Dian Ang Yap</au><au>zhang, Yizhe</au><au>Ahnert, Karsten</au><au>Kamath, Vik</au><au>Berglund, Mathias</au><au>Walsh, Dominic</au><au>Gindele, Tobias</au><au>Wiest, Juergen</au><au>Lai, Zhengfeng</au><au>Wang, Xiaoming</au><au>Jiulong Shan</au><au>Cao, Meng</au><au>Pang, Ruoming</au><au>Wang, Zirui</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains</atitle><jtitle>arXiv.org</jtitle><date>2024-08-15</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics, and covers five essential capabilities: Understanding, Reasoning, Planning, Problem-solving, and Self-correction. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 18 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance. Datasets and evaluation scripts of MMAU are released at https://github.com/apple/axlearn/tree/main/docs/research/mmau.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-08
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3086453874
source	Free E- Journals
subjects	Benchmarks Environment models Large language models Machine learning Performance evaluation Reagents Task complexity
title	MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T01%3A42%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=MMAU:%20A%20Holistic%20Benchmark%20of%20Agent%20Capabilities%20Across%20Diverse%20Domains&rft.jtitle=arXiv.org&rft.au=Yin,%20Guoli&rft.date=2024-08-15&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3086453874%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3086453874&rft_id=info:pmid/&rfr_iscdi=true