TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs
The rapid advancement of large language models (LLMs) has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate LLMs' strategic reasoning capabilities, game theory, with its concise structure, has become a preferred approach. However, cur...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The rapid advancement of large language models (LLMs) has accelerated their
application in reasoning, with strategic reasoning drawing increasing
attention. To evaluate LLMs' strategic reasoning capabilities, game theory,
with its concise structure, has become a preferred approach. However, current
research focuses on a limited selection of games, resulting in low coverage.
Classic game scenarios risk data leakage, and existing benchmarks often lack
extensibility, making them inadequate for evaluating state-of-the-art models.
To address these challenges, we propose TMGBench, a benchmark with
comprehensive game type coverage, novel scenarios, and flexible organization.
Specifically, we incorporate all 144 game types summarized by the
Robinson-Goforth topology of 2x2 games, constructed as classic games. We also
employ synthetic data generation to create diverse, higher-quality scenarios
through topic guidance and human inspection, referred to as story-based games.
Lastly, we provide a sustainable framework for increasingly powerful LLMs by
treating these games as atomic units and organizing them into more complex
forms via sequential, parallel, and nested structures. Our comprehensive
evaluation of mainstream LLMs covers tests on rational reasoning, robustness,
Theory-of-Mind (ToM), and reasoning in complex forms. Results reveal flaws in
accuracy, consistency, and varying mastery of ToM. Additionally, o1-mini,
OpenAI's latest reasoning model, achieved accuracy rates of 66.6%, 60.0%, and
70.0% on sequential, parallel, and nested games, highlighting TMGBench's
challenges. |
---|---|
DOI: | 10.48550/arxiv.2410.10479 |