TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, cur...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Wang, Haochuan, Feng, Xiachong, Li, Lei, Qin, Zhanyue, Sui, Dianbo, Kong, Lingpeng
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computer Science and Game Theory
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Wang, Haochuan Feng, Xiachong Li, Lei Qin, Zhanyue Sui, Dianbo Kong, Lingpeng
description	The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, current research focuses on a limited selection of games, resulting in low coverage. Classic game scenarios risk data leakage, and existing benchmarks often lack extensibility, making them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, a benchmark with comprehensive game type coverage, novel scenarios, and flexible organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games. We also employ synthetic data generation to create diverse, higher-quality scenarios through topic guidance and human inspection, referred to as story-based games. Lastly, we provide a sustainable framework for increasingly powerful LLMs by treating these games as atomic units and organizing them into more complex forms via sequential, parallel, and nested structures. Our comprehensive evaluation of mainstream LLMs covers tests on rational reasoning, robustness, Theory-of-Mind (ToM), and reasoning in complex forms. Results reveal flaws in accuracy, consistency, and varying mastery of ToM. Additionally, o1-mini, OpenAI's latest reasoning model, achieved accuracy rates of 66.6%, 60.0%, and 70.0% on sequential, parallel, and nested games, highlighting TMGBench's challenges.
doi_str_mv	10.48550/arxiv.2410.10479
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2410_10479</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2410_10479</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2410_104793</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMgEKGBqYmFtyMkSF-Lo7peYlZ1gpOCoEVxaXpOYmlmQmK7gn5qYqgCVyE4uyFdLyixRcyxJzSoGSeekKwSVFiSWp6UB1QamJxfl5IDHHpMyczJLM1GKF_DQFHx_fYh4G1rTEnOJUXijNzSDv5hri7KELdkV8QVEm0OjKeJBr4sGuMSasAgAFrT8W</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs</title><source>arXiv.org</source><creator>Wang, Haochuan ; Feng, Xiachong ; Li, Lei ; Qin, Zhanyue ; Sui, Dianbo ; Kong, Lingpeng</creator><creatorcontrib>Wang, Haochuan ; Feng, Xiachong ; Li, Lei ; Qin, Zhanyue ; Sui, Dianbo ; Kong, Lingpeng</creatorcontrib><description>The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, current research focuses on a limited selection of games, resulting in low coverage. Classic game scenarios risk data leakage, and existing benchmarks often lack extensibility, making them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, a benchmark with comprehensive game type coverage, novel scenarios, and flexible organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games. We also employ synthetic data generation to create diverse, higher-quality scenarios through topic guidance and human inspection, referred to as story-based games. Lastly, we provide a sustainable framework for increasingly powerful LLMs by treating these games as atomic units and organizing them into more complex forms via sequential, parallel, and nested structures. Our comprehensive evaluation of mainstream LLMs covers tests on rational reasoning, robustness, Theory-of-Mind (ToM), and reasoning in complex forms. Results reveal flaws in accuracy, consistency, and varying mastery of ToM. Additionally, o1-mini, OpenAI's latest reasoning model, achieved accuracy rates of 66.6%, 60.0%, and 70.0% on sequential, parallel, and nested games, highlighting TMGBench's challenges.</description><identifier>DOI: 10.48550/arxiv.2410.10479</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Science and Game Theory</subject><creationdate>2024-10</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2410.10479$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2410.10479$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Haochuan</creatorcontrib><creatorcontrib>Feng, Xiachong</creatorcontrib><creatorcontrib>Li, Lei</creatorcontrib><creatorcontrib>Qin, Zhanyue</creatorcontrib><creatorcontrib>Sui, Dianbo</creatorcontrib><creatorcontrib>Kong, Lingpeng</creatorcontrib><title>TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs</title><description>The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, current research focuses on a limited selection of games, resulting in low coverage. Classic game scenarios risk data leakage, and existing benchmarks often lack extensibility, making them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, a benchmark with comprehensive game type coverage, novel scenarios, and flexible organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games. We also employ synthetic data generation to create diverse, higher-quality scenarios through topic guidance and human inspection, referred to as story-based games. Lastly, we provide a sustainable framework for increasingly powerful LLMs by treating these games as atomic units and organizing them into more complex forms via sequential, parallel, and nested structures. Our comprehensive evaluation of mainstream LLMs covers tests on rational reasoning, robustness, Theory-of-Mind (ToM), and reasoning in complex forms. Results reveal flaws in accuracy, consistency, and varying mastery of ToM. Additionally, o1-mini, OpenAI's latest reasoning model, achieved accuracy rates of 66.6%, 60.0%, and 70.0% on sequential, parallel, and nested games, highlighting TMGBench's challenges.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Science and Game Theory</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMgEKGBqYmFtyMkSF-Lo7peYlZ1gpOCoEVxaXpOYmlmQmK7gn5qYqgCVyE4uyFdLyixRcyxJzSoGSeekKwSVFiSWp6UB1QamJxfl5IDHHpMyczJLM1GKF_DQFHx_fYh4G1rTEnOJUXijNzSDv5hri7KELdkV8QVEm0OjKeJBr4sGuMSasAgAFrT8W</recordid><startdate>20241014</startdate><enddate>20241014</enddate><creator>Wang, Haochuan</creator><creator>Feng, Xiachong</creator><creator>Li, Lei</creator><creator>Qin, Zhanyue</creator><creator>Sui, Dianbo</creator><creator>Kong, Lingpeng</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241014</creationdate><title>TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs</title><author>Wang, Haochuan ; Feng, Xiachong ; Li, Lei ; Qin, Zhanyue ; Sui, Dianbo ; Kong, Lingpeng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2410_104793</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Science and Game Theory</topic><toplevel>online_resources</toplevel><creatorcontrib>Wang, Haochuan</creatorcontrib><creatorcontrib>Feng, Xiachong</creatorcontrib><creatorcontrib>Li, Lei</creatorcontrib><creatorcontrib>Qin, Zhanyue</creatorcontrib><creatorcontrib>Sui, Dianbo</creatorcontrib><creatorcontrib>Kong, Lingpeng</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Haochuan</au><au>Feng, Xiachong</au><au>Li, Lei</au><au>Qin, Zhanyue</au><au>Sui, Dianbo</au><au>Kong, Lingpeng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs</atitle><date>2024-10-14</date><risdate>2024</risdate><abstract>The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, current research focuses on a limited selection of games, resulting in low coverage. Classic game scenarios risk data leakage, and existing benchmarks often lack extensibility, making them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, a benchmark with comprehensive game type coverage, novel scenarios, and flexible organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games. We also employ synthetic data generation to create diverse, higher-quality scenarios through topic guidance and human inspection, referred to as story-based games. Lastly, we provide a sustainable framework for increasingly powerful LLMs by treating these games as atomic units and organizing them into more complex forms via sequential, parallel, and nested structures. Our comprehensive evaluation of mainstream LLMs covers tests on rational reasoning, robustness, Theory-of-Mind (ToM), and reasoning in complex forms. Results reveal flaws in accuracy, consistency, and varying mastery of ToM. Additionally, o1-mini, OpenAI's latest reasoning model, achieved accuracy rates of 66.6%, 60.0%, and 70.0% on sequential, parallel, and nested games, highlighting TMGBench's challenges.</abstract><doi>10.48550/arxiv.2410.10479</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2410.10479
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2410_10479
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computer Science and Game Theory
title	TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T06%3A19%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=TMGBench:%20A%20Systematic%20Game%20Benchmark%20for%20Evaluating%20Strategic%20Reasoning%20Abilities%20of%20LLMs&rft.au=Wang,%20Haochuan&rft.date=2024-10-14&rft_id=info:doi/10.48550/arxiv.2410.10479&rft_dat=%3Carxiv_GOX%3E2410_10479%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true