Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models
The ability to build and leverage world models is essential for a general-purpose AI agent. Testing such capabilities is hard, in part because the building blocks of world models are ill-defined. We present Elements of World Knowledge (EWOK), a framework for evaluating world modeling in language mod...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The ability to build and leverage world models is essential for a
general-purpose AI agent. Testing such capabilities is hard, in part because
the building blocks of world models are ill-defined. We present Elements of
World Knowledge (EWOK), a framework for evaluating world modeling in language
models by testing their ability to use knowledge of a concept to match a target
text with a plausible/implausible context. EWOK targets specific concepts from
multiple knowledge domains known to be vital for world modeling in humans.
Domains range from social interactions (help/hinder) to spatial relations
(left/right). Both, contexts and targets are minimal pairs. Objects, agents,
and locations in the items can be flexibly filled in enabling easy generation
of multiple controlled datasets. We then introduce EWOK-CORE-1.0, a dataset of
4,374 items covering 11 world knowledge domains. We evaluate 20 openweights
large language models (1.3B--70B parameters) across a battery of evaluation
paradigms along with a human norming study comprising 12,480 measurements. The
overall performance of all tested models is worse than human performance, with
results varying drastically across domains. These data highlight simple cases
where even large models fail and present rich avenues for targeted research on
LLM world modeling capabilities. |
---|---|
DOI: | 10.48550/arxiv.2405.09605 |