MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework
Artificial intelligence (AI) and large language models (LLMs) in healthcare require advanced clinical skills (CS), yet current benchmarks fail to evaluate these comprehensively. We introduce MedQA-CS, an AI-SCE framework inspired by medical education's Objective Structured Clinical Examinations...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Artificial intelligence (AI) and large language models (LLMs) in healthcare
require advanced clinical skills (CS), yet current benchmarks fail to evaluate
these comprehensively. We introduce MedQA-CS, an AI-SCE framework inspired by
medical education's Objective Structured Clinical Examinations (OSCEs), to
address this gap. MedQA-CS evaluates LLMs through two instruction-following
tasks, LLM-as-medical-student and LLM-as-CS-examiner, designed to reflect real
clinical scenarios. Our contributions include developing MedQA-CS, a
comprehensive evaluation framework with publicly available data and expert
annotations, and providing the quantitative and qualitative assessment of LLMs
as reliable judges in CS evaluation. Our experiments show that MedQA-CS is a
more challenging benchmark for evaluating clinical skills than traditional
multiple-choice QA benchmarks (e.g., MedQA). Combined with existing benchmarks,
MedQA-CS enables a more comprehensive evaluation of LLMs' clinical capabilities
for both open- and closed-source LLMs. |
---|---|
DOI: | 10.48550/arxiv.2410.01553 |