LAB-Bench: Measuring Capabilities of Language Models for Biology Research
There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | There is widespread optimism that frontier Large Language Models (LLMs) and
LLM-augmented systems have the potential to rapidly accelerate scientific
discovery across disciplines. Today, many benchmarks exist to measure LLM
knowledge and reasoning on textbook-style science questions, but few if any
benchmarks are designed to evaluate language model performance on practical
tasks required for scientific research, such as literature search, protocol
planning, and data analysis. As a step toward building such benchmarks, we
introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of
over 2,400 multiple choice questions for evaluating AI systems on a range of
practical biology research capabilities, including recall and reasoning over
literature, interpretation of figures, access and navigation of databases, and
comprehension and manipulation of DNA and protein sequences. Importantly, in
contrast to previous scientific benchmarks, we expect that an AI system that
can achieve consistently high scores on the more difficult LAB-Bench tasks
would serve as a useful assistant for researchers in areas such as literature
search and molecular cloning. As an initial assessment of the emergent
scientific task capabilities of frontier language models, we measure
performance of several against our benchmark and report results compared to
human expert biology researchers. We will continue to update and expand
LAB-Bench over time, and expect it to serve as a useful tool in the development
of automated research systems going forward. A public subset of LAB-Bench is
available for use at the following URL:
https://huggingface.co/datasets/futurehouse/lab-bench |
---|---|
DOI: | 10.48550/arxiv.2407.10362 |