Language agents achieve superhuman synthesis of scientific knowledge
Language models are known to hallucinate incorrect information, and it is unclear if they are sufficiently accurate and reliable for use in scientific research. We developed a rigorous human-AI comparison methodology to evaluate language model agents on real-world literature search tasks covering in...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Language models are known to hallucinate incorrect information, and it is
unclear if they are sufficiently accurate and reliable for use in scientific
research. We developed a rigorous human-AI comparison methodology to evaluate
language model agents on real-world literature search tasks covering
information retrieval, summarization, and contradiction detection tasks. We
show that PaperQA2, a frontier language model agent optimized for improved
factuality, matches or exceeds subject matter expert performance on three
realistic literature research tasks without any restrictions on humans (i.e.,
full access to internet, search tools, and time). PaperQA2 writes cited,
Wikipedia-style summaries of scientific topics that are significantly more
accurate than existing, human-written Wikipedia articles. We also introduce a
hard benchmark for scientific literature research called LitQA2 that guided
design of PaperQA2, leading to it exceeding human performance. Finally, we
apply PaperQA2 to identify contradictions within the scientific literature, an
important scientific task that is challenging for humans. PaperQA2 identifies
2.34 +/- 1.99 contradictions per paper in a random subset of biology papers, of
which 70% are validated by human experts. These results demonstrate that
language model agents are now capable of exceeding domain experts across
meaningful tasks on scientific literature. |
---|---|
DOI: | 10.48550/arxiv.2409.13740 |