Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems
Benchmarks are critical for measuring progress of math reasoning abilities of Large Language Models (LLMs). However, existing widely-used benchmarks such as GSM8K have been rendered less useful as multiple cutting-edge LLMs achieve over 94% accuracy. While harder benchmarks have been proposed, their...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Benchmarks are critical for measuring progress of math reasoning abilities of
Large Language Models (LLMs). However, existing widely-used benchmarks such as
GSM8K have been rendered less useful as multiple cutting-edge LLMs achieve over
94% accuracy. While harder benchmarks have been proposed, their creation is
often manual and expensive. We present Scheherazade, an automated approach for
producing challenging mathematical reasoning benchmarks by logically chaining
mathematical reasoning problems. We propose two different chaining methods,
forward chaining and backward chaining, which require reasoning forward and
backward through the chain respectively. We apply Scheherazade on GSM8K to
create GSM8K-Scheherazade and evaluate 3 frontier LLMs and OpenAI's o1-preview
on it. We show that while frontier models' performance declines precipitously
at only a few questions chained, a preliminary evaluation suggests o1-preview
performance persists up to 5 questions chained backwards. In addition, while
all other models perform worse when problems are chained backwards, o1-preview
performs better on backward-chained benchmarks. We will release the dataset and
code publicly. |
---|---|
DOI: | 10.48550/arxiv.2410.00151 |