Large Language Models in the Clinic: A Comprehensive Benchmark
The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set o...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The adoption of large language models (LLMs) to assist clinicians has
attracted remarkable attention. Existing works mainly adopt the close-ended
question-answering (QA) task with answer options for evaluation. However, many
clinical decisions involve answering open-ended questions without pre-set
options. To better understand LLMs in the clinic, we construct a benchmark
ClinicBench. We first collect eleven existing datasets covering diverse
clinical language generation, understanding, and reasoning tasks. Furthermore,
we construct six novel datasets and clinical tasks that are complex but common
in real-world practice, e.g., open-ended decision-making, long document
processing, and emerging drug analysis. We conduct an extensive evaluation of
twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite
medical experts to evaluate the clinical usefulness of LLMs. The benchmark data
is available at https://github.com/AI-in-Health/ClinicBench. |
---|---|
DOI: | 10.48550/arxiv.2405.00716 |