Law of the Weakest Link: Cross Capabilities of Large Language Models

The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To system...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zhong, Ming, Zhang, Aston, Wang, Xuewei, Hou, Rui, Xiong, Wenhan, Zhu, Chenguang, Chen, Zhengxing, Tan, Liang, Bi, Chloe, Lewis, Mike, Popuri, Sravya, Narang, Sharan, Kambadur, Melanie, Mahajan, Dhruv, Edunov, Sergey, Han, Jiawei, van der Maaten, Laurens
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zhong, Ming Zhang, Aston Wang, Xuewei Hou, Rui Xiong, Wenhan Zhu, Chenguang Chen, Zhengxing Tan, Liang Bi, Chloe Lewis, Mike Popuri, Sravya Narang, Sharan Kambadur, Melanie Mahajan, Dhruv Edunov, Sergey Han, Jiawei van der Maaten, Laurens
description	The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.
doi_str_mv	10.48550/arxiv.2409.19951
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_19951</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_19951</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_199513</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DO0tDQ15GRw8UksV8hPUyjJSFUIT03MTi0uUfDJzMu2UnAuyi8uVnBOLEhMyszJLMlMLQap80ksSk8FknnppYlAhm9-SmpOMQ8Da1piTnEqL5TmZpB3cw1x9tAF2xdfUJSZm1hUGQ-yNx5srzFhFQBgvDci</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Law of the Weakest Link: Cross Capabilities of Large Language Models</title><source>arXiv.org</source><creator>Zhong, Ming ; Zhang, Aston ; Wang, Xuewei ; Hou, Rui ; Xiong, Wenhan ; Zhu, Chenguang ; Chen, Zhengxing ; Tan, Liang ; Bi, Chloe ; Lewis, Mike ; Popuri, Sravya ; Narang, Sharan ; Kambadur, Melanie ; Mahajan, Dhruv ; Edunov, Sergey ; Han, Jiawei ; van der Maaten, Laurens</creator><creatorcontrib>Zhong, Ming ; Zhang, Aston ; Wang, Xuewei ; Hou, Rui ; Xiong, Wenhan ; Zhu, Chenguang ; Chen, Zhengxing ; Tan, Liang ; Bi, Chloe ; Lewis, Mike ; Popuri, Sravya ; Narang, Sharan ; Kambadur, Melanie ; Mahajan, Dhruv ; Edunov, Sergey ; Han, Jiawei ; van der Maaten, Laurens</creatorcontrib><description>The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.</description><identifier>DOI: 10.48550/arxiv.2409.19951</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-09</creationdate><rights>http://creativecommons.org/licenses/by-nc-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.19951$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.19951$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhong, Ming</creatorcontrib><creatorcontrib>Zhang, Aston</creatorcontrib><creatorcontrib>Wang, Xuewei</creatorcontrib><creatorcontrib>Hou, Rui</creatorcontrib><creatorcontrib>Xiong, Wenhan</creatorcontrib><creatorcontrib>Zhu, Chenguang</creatorcontrib><creatorcontrib>Chen, Zhengxing</creatorcontrib><creatorcontrib>Tan, Liang</creatorcontrib><creatorcontrib>Bi, Chloe</creatorcontrib><creatorcontrib>Lewis, Mike</creatorcontrib><creatorcontrib>Popuri, Sravya</creatorcontrib><creatorcontrib>Narang, Sharan</creatorcontrib><creatorcontrib>Kambadur, Melanie</creatorcontrib><creatorcontrib>Mahajan, Dhruv</creatorcontrib><creatorcontrib>Edunov, Sergey</creatorcontrib><creatorcontrib>Han, Jiawei</creatorcontrib><creatorcontrib>van der Maaten, Laurens</creatorcontrib><title>Law of the Weakest Link: Cross Capabilities of Large Language Models</title><description>The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DO0tDQ15GRw8UksV8hPUyjJSFUIT03MTi0uUfDJzMu2UnAuyi8uVnBOLEhMyszJLMlMLQap80ksSk8FknnppYlAhm9-SmpOMQ8Da1piTnEqL5TmZpB3cw1x9tAF2xdfUJSZm1hUGQ-yNx5srzFhFQBgvDci</recordid><startdate>20240930</startdate><enddate>20240930</enddate><creator>Zhong, Ming</creator><creator>Zhang, Aston</creator><creator>Wang, Xuewei</creator><creator>Hou, Rui</creator><creator>Xiong, Wenhan</creator><creator>Zhu, Chenguang</creator><creator>Chen, Zhengxing</creator><creator>Tan, Liang</creator><creator>Bi, Chloe</creator><creator>Lewis, Mike</creator><creator>Popuri, Sravya</creator><creator>Narang, Sharan</creator><creator>Kambadur, Melanie</creator><creator>Mahajan, Dhruv</creator><creator>Edunov, Sergey</creator><creator>Han, Jiawei</creator><creator>van der Maaten, Laurens</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240930</creationdate><title>Law of the Weakest Link: Cross Capabilities of Large Language Models</title><author>Zhong, Ming ; Zhang, Aston ; Wang, Xuewei ; Hou, Rui ; Xiong, Wenhan ; Zhu, Chenguang ; Chen, Zhengxing ; Tan, Liang ; Bi, Chloe ; Lewis, Mike ; Popuri, Sravya ; Narang, Sharan ; Kambadur, Melanie ; Mahajan, Dhruv ; Edunov, Sergey ; Han, Jiawei ; van der Maaten, Laurens</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_199513</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhong, Ming</creatorcontrib><creatorcontrib>Zhang, Aston</creatorcontrib><creatorcontrib>Wang, Xuewei</creatorcontrib><creatorcontrib>Hou, Rui</creatorcontrib><creatorcontrib>Xiong, Wenhan</creatorcontrib><creatorcontrib>Zhu, Chenguang</creatorcontrib><creatorcontrib>Chen, Zhengxing</creatorcontrib><creatorcontrib>Tan, Liang</creatorcontrib><creatorcontrib>Bi, Chloe</creatorcontrib><creatorcontrib>Lewis, Mike</creatorcontrib><creatorcontrib>Popuri, Sravya</creatorcontrib><creatorcontrib>Narang, Sharan</creatorcontrib><creatorcontrib>Kambadur, Melanie</creatorcontrib><creatorcontrib>Mahajan, Dhruv</creatorcontrib><creatorcontrib>Edunov, Sergey</creatorcontrib><creatorcontrib>Han, Jiawei</creatorcontrib><creatorcontrib>van der Maaten, Laurens</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhong, Ming</au><au>Zhang, Aston</au><au>Wang, Xuewei</au><au>Hou, Rui</au><au>Xiong, Wenhan</au><au>Zhu, Chenguang</au><au>Chen, Zhengxing</au><au>Tan, Liang</au><au>Bi, Chloe</au><au>Lewis, Mike</au><au>Popuri, Sravya</au><au>Narang, Sharan</au><au>Kambadur, Melanie</au><au>Mahajan, Dhruv</au><au>Edunov, Sergey</au><au>Han, Jiawei</au><au>van der Maaten, Laurens</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Law of the Weakest Link: Cross Capabilities of Large Language Models</atitle><date>2024-09-30</date><risdate>2024</risdate><abstract>The development and evaluation of Large Language Models (LLMs) have largely focused on individual capabilities. However, this overlooks the intersection of multiple abilities across different types of expertise that are often required for real-world tasks, which we term cross capabilities. To systematically explore this concept, we first define seven core individual capabilities and then pair them to form seven common cross capabilities, each supported by a manually constructed taxonomy. Building on these definitions, we introduce CrossEval, a benchmark comprising 1,400 human-annotated prompts, with 100 prompts for each individual and cross capability. To ensure reliable evaluation, we involve expert annotators to assess 4,200 model responses, gathering 8,400 human ratings with detailed explanations to serve as reference examples. Our findings reveal that, in both static evaluations and attempts to enhance specific abilities, current LLMs consistently exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component. Specifically, across 58 cross-capability scores from 17 models, 38 scores are lower than all individual capabilities, while 20 fall between strong and weak, but closer to the weaker ability. These results highlight the under-performance of LLMs in cross-capability tasks, making the identification and improvement of the weakest capabilities a critical priority for future research to optimize performance in complex, multi-dimensional scenarios.</abstract><doi>10.48550/arxiv.2409.19951</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2409.19951
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2409_19951
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition
title	Law of the Weakest Link: Cross Capabilities of Large Language Models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-21T15%3A07%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Law%20of%20the%20Weakest%20Link:%20Cross%20Capabilities%20of%20Large%20Language%20Models&rft.au=Zhong,%20Ming&rft.date=2024-09-30&rft_id=info:doi/10.48550/arxiv.2409.19951&rft_dat=%3Carxiv_GOX%3E2409_19951%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true