GameArena: Evaluating LLM Reasoning through Live Computer Games

Evaluating the reasoning abilities of large language models (LLMs) is challenging. Existing benchmarks often depend on static datasets, which are vulnerable to data contamination and may get saturated over time, or on binary live human feedback that conflates reasoning with other abilities. As the m...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-12
Hauptverfasser:	Hu, Lanxiang, Li, Qiyu, Xie, Anze, Jiang, Nan, Stoica, Ion, Jin, Haojian, Zhang, Hao
Format:	Artikel
Sprache:	eng
Schlagworte:	Benchmarks Chatbots Computer & video games Evaluation Human performance Large language models Reasoning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Schreiben Sie den ersten Kommentar!