TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, cur...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Wang, Haochuan, Feng, Xiachong, Li, Lei, Qin, Zhanyue, Sui, Dianbo, Kong, Lingpeng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Wang, Haochuan
Feng, Xiachong
Li, Lei
Qin, Zhanyue
Sui, Dianbo
Kong, Lingpeng
description The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, current research focuses on a limited selection of games, resulting in low coverage. Classic game scenarios risk data leakage, and existing benchmarks often lack extensibility, making them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, a benchmark with comprehensive game type coverage, novel scenarios, and flexible organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games. We also employ synthetic data generation to create diverse, higher-quality scenarios through topic guidance and human inspection, referred to as story-based games. Lastly, we provide a sustainable framework for increasingly powerful LLMs by treating these games as atomic units and organizing them into more complex forms via sequential, parallel, and nested structures. Our comprehensive evaluation of mainstream LLMs covers tests on rational reasoning, robustness, Theory-of-Mind (ToM), and reasoning in complex forms. Results reveal flaws in accuracy, consistency, and varying mastery of ToM. Additionally, o1-mini, OpenAI's latest reasoning model, achieved accuracy rates of 66.6%, 60.0%, and 70.0% on sequential, parallel, and nested games, highlighting TMGBench's challenges.
doi_str_mv 10.48550/arxiv.2410.10479
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2410_10479</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2410_10479</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2410_104793</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMgEKGBqYmFtyMkSF-Lo7peYlZ1gpOCoEVxaXpOYmlmQmK7gn5qYqgCVyE4uyFdLyixRcyxJzSoGSeekKwSVFiSWp6UB1QamJxfl5IDHHpMyczJLM1GKF_DQFHx_fYh4G1rTEnOJUXijNzSDv5hri7KELdkV8QVEm0OjKeJBr4sGuMSasAgAFrT8W</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs</title><source>arXiv.org</source><creator>Wang, Haochuan ; Feng, Xiachong ; Li, Lei ; Qin, Zhanyue ; Sui, Dianbo ; Kong, Lingpeng</creator><creatorcontrib>Wang, Haochuan ; Feng, Xiachong ; Li, Lei ; Qin, Zhanyue ; Sui, Dianbo ; Kong, Lingpeng</creatorcontrib><description>The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, current research focuses on a limited selection of games, resulting in low coverage. Classic game scenarios risk data leakage, and existing benchmarks often lack extensibility, making them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, a benchmark with comprehensive game type coverage, novel scenarios, and flexible organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games. We also employ synthetic data generation to create diverse, higher-quality scenarios through topic guidance and human inspection, referred to as story-based games. Lastly, we provide a sustainable framework for increasingly powerful LLMs by treating these games as atomic units and organizing them into more complex forms via sequential, parallel, and nested structures. Our comprehensive evaluation of mainstream LLMs covers tests on rational reasoning, robustness, Theory-of-Mind (ToM), and reasoning in complex forms. Results reveal flaws in accuracy, consistency, and varying mastery of ToM. Additionally, o1-mini, OpenAI's latest reasoning model, achieved accuracy rates of 66.6%, 60.0%, and 70.0% on sequential, parallel, and nested games, highlighting TMGBench's challenges.</description><identifier>DOI: 10.48550/arxiv.2410.10479</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Science and Game Theory</subject><creationdate>2024-10</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2410.10479$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2410.10479$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Haochuan</creatorcontrib><creatorcontrib>Feng, Xiachong</creatorcontrib><creatorcontrib>Li, Lei</creatorcontrib><creatorcontrib>Qin, Zhanyue</creatorcontrib><creatorcontrib>Sui, Dianbo</creatorcontrib><creatorcontrib>Kong, Lingpeng</creatorcontrib><title>TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs</title><description>The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, current research focuses on a limited selection of games, resulting in low coverage. Classic game scenarios risk data leakage, and existing benchmarks often lack extensibility, making them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, a benchmark with comprehensive game type coverage, novel scenarios, and flexible organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games. We also employ synthetic data generation to create diverse, higher-quality scenarios through topic guidance and human inspection, referred to as story-based games. Lastly, we provide a sustainable framework for increasingly powerful LLMs by treating these games as atomic units and organizing them into more complex forms via sequential, parallel, and nested structures. Our comprehensive evaluation of mainstream LLMs covers tests on rational reasoning, robustness, Theory-of-Mind (ToM), and reasoning in complex forms. Results reveal flaws in accuracy, consistency, and varying mastery of ToM. Additionally, o1-mini, OpenAI's latest reasoning model, achieved accuracy rates of 66.6%, 60.0%, and 70.0% on sequential, parallel, and nested games, highlighting TMGBench's challenges.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Science and Game Theory</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMgEKGBqYmFtyMkSF-Lo7peYlZ1gpOCoEVxaXpOYmlmQmK7gn5qYqgCVyE4uyFdLyixRcyxJzSoGSeekKwSVFiSWp6UB1QamJxfl5IDHHpMyczJLM1GKF_DQFHx_fYh4G1rTEnOJUXijNzSDv5hri7KELdkV8QVEm0OjKeJBr4sGuMSasAgAFrT8W</recordid><startdate>20241014</startdate><enddate>20241014</enddate><creator>Wang, Haochuan</creator><creator>Feng, Xiachong</creator><creator>Li, Lei</creator><creator>Qin, Zhanyue</creator><creator>Sui, Dianbo</creator><creator>Kong, Lingpeng</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241014</creationdate><title>TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs</title><author>Wang, Haochuan ; Feng, Xiachong ; Li, Lei ; Qin, Zhanyue ; Sui, Dianbo ; Kong, Lingpeng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2410_104793</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Science and Game Theory</topic><toplevel>online_resources</toplevel><creatorcontrib>Wang, Haochuan</creatorcontrib><creatorcontrib>Feng, Xiachong</creatorcontrib><creatorcontrib>Li, Lei</creatorcontrib><creatorcontrib>Qin, Zhanyue</creatorcontrib><creatorcontrib>Sui, Dianbo</creatorcontrib><creatorcontrib>Kong, Lingpeng</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Haochuan</au><au>Feng, Xiachong</au><au>Li, Lei</au><au>Qin, Zhanyue</au><au>Sui, Dianbo</au><au>Kong, Lingpeng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs</atitle><date>2024-10-14</date><risdate>2024</risdate><abstract>The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, current research focuses on a limited selection of games, resulting in low coverage. Classic game scenarios risk data leakage, and existing benchmarks often lack extensibility, making them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, a benchmark with comprehensive game type coverage, novel scenarios, and flexible organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games. We also employ synthetic data generation to create diverse, higher-quality scenarios through topic guidance and human inspection, referred to as story-based games. Lastly, we provide a sustainable framework for increasingly powerful LLMs by treating these games as atomic units and organizing them into more complex forms via sequential, parallel, and nested structures. Our comprehensive evaluation of mainstream LLMs covers tests on rational reasoning, robustness, Theory-of-Mind (ToM), and reasoning in complex forms. Results reveal flaws in accuracy, consistency, and varying mastery of ToM. Additionally, o1-mini, OpenAI's latest reasoning model, achieved accuracy rates of 66.6%, 60.0%, and 70.0% on sequential, parallel, and nested games, highlighting TMGBench's challenges.</abstract><doi>10.48550/arxiv.2410.10479</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2410.10479
ispartof
issn
language eng
recordid cdi_arxiv_primary_2410_10479
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computer Science and Game Theory
title TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T06%3A19%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=TMGBench:%20A%20Systematic%20Game%20Benchmark%20for%20Evaluating%20Strategic%20Reasoning%20Abilities%20of%20LLMs&rft.au=Wang,%20Haochuan&rft.date=2024-10-14&rft_id=info:doi/10.48550/arxiv.2410.10479&rft_dat=%3Carxiv_GOX%3E2410_10479%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true