Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems

Benchmarks are critical for measuring progress of math reasoning abilities of Large Language Models (LLMs). However, existing widely-used benchmarks such as GSM8K have been rendered less useful as multiple cutting-edge LLMs achieve over 94% accuracy. While harder benchmarks have been proposed, their...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Miner, Stephen, Takashima, Yoshiki, Han, Simeng, Erata, Ferhat, Antonopoulos, Timos, Piskac, Ruzica, Shapiro, Scott J
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!