OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Scientific progress depends on researchers' ability to synthesize the growing
body of literature. Can large language models (LMs) assist scientists in this
task? We introduce OpenScholar, a specialized retrieval-augmented LM that
answers scientific queries by identifying relevant passages from 45 million
open-access papers and synthesizing citation-backed responses. To evaluate
OpenScholar, we develop ScholarQABench, the first large-scale multi-domain
benchmark for literature search, comprising 2,967 expert-written queries and
208 long-form answers across computer science, physics, neuroscience, and
biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and
PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o
hallucinates citations 78 to 90% of the time, OpenScholar achieves citation
accuracy on par with human experts. OpenScholar's datastore, retriever, and
self-feedback inference loop also improves off-the-shelf LMs: for instance,
OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations,
experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over
expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's
32%. We open-source all of our code, models, datastore, data and a public demo. |
---|---|
DOI: | 10.48550/arxiv.2411.14199 |