modeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models
We introduce modeLing, a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. Solving these puzzles necessitates inferring aspects of a language's grammatical structure from a small number of examples. Such puzzles provide a natural testbed for lan...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We introduce modeLing, a novel benchmark of Linguistics Olympiad-style
puzzles which tests few-shot reasoning in AI systems. Solving these puzzles
necessitates inferring aspects of a language's grammatical structure from a
small number of examples. Such puzzles provide a natural testbed for language
models, as they require compositional generalization and few-shot inductive
reasoning. Consisting solely of new puzzles written specifically for this work,
modeLing has no risk of appearing in the training data of existing AI systems:
this ameliorates the risk of data leakage, a potential confounder for many
prior evaluations of reasoning. Evaluating several large open source language
models and GPT on our benchmark, we observe non-negligible accuracy,
demonstrating few-shot emergent reasoning ability which cannot merely be
attributed to shallow memorization. However, imperfect model performance
suggests that modeLing can be used to measure further progress in linguistic
reasoning. |
---|---|
DOI: | 10.48550/arxiv.2406.17038 |