BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information
Automated reasoning with unstructured natural text is a key requirement for many potential applications of NLP and for developing robust AI systems. Recently, Language Models (LMs) have demonstrated complex reasoning capacities even without any finetuning. However, existing evaluation for automated...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Automated reasoning with unstructured natural text is a key requirement for
many potential applications of NLP and for developing robust AI systems.
Recently, Language Models (LMs) have demonstrated complex reasoning capacities
even without any finetuning. However, existing evaluation for automated
reasoning assumes access to a consistent and coherent set of information over
which models reason. When reasoning in the real-world, the available
information is frequently inconsistent or contradictory, and therefore models
need to be equipped with a strategy to resolve such conflicts when they arise.
One widely-applicable way of resolving conflicts is to impose preferences over
information sources (e.g., based on source credibility or information recency)
and adopt the source with higher preference. In this paper, we formulate the
problem of reasoning with contradictory information guided by preferences over
sources as the classical problem of defeasible reasoning, and develop a dataset
called BoardgameQA for measuring the reasoning capacity of LMs in this setting.
BoardgameQA also incorporates reasoning with implicit background knowledge, to
better reflect reasoning problems in downstream applications. We benchmark
various LMs on BoardgameQA and the results reveal a significant gap in the
reasoning capacity of state-of-the-art LMs on this problem, showing that
reasoning with conflicting information does not surface out-of-the-box in LMs.
While performance can be improved with finetuning, it nevertheless remains
poor. |
---|---|
DOI: | 10.48550/arxiv.2306.07934 |