MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
Ensuring the general efficacy and goodness for human beings from medical large language models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we i...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Ensuring the general efficacy and goodness for human beings from medical
large language models (LLM) before real-world deployment is crucial. However, a
widely accepted and accessible evaluation process for medical LLM, especially
in the Chinese context, remains to be established. In this work, we introduce
"MedBench", a comprehensive, standardized, and reliable benchmarking system for
Chinese medical LLM. First, MedBench assembles the currently largest evaluation
dataset (300,901 questions) to cover 43 clinical specialties and performs
multi-facet evaluation on medical LLM. Second, MedBench provides a standardized
and fully automatic cloud-based evaluation infrastructure, with physical
separations for question and ground truth. Third, MedBench implements dynamic
evaluation mechanisms to prevent shortcut learning and answer remembering.
Applying MedBench to popular general and medical LLMs, we observe unbiased,
reproducible evaluation results largely aligning with medical professionals'
perspectives. This study establishes a significant foundation for preparing the
practical applications of Chinese medical LLMs. MedBench is publicly accessible
at https://medbench.opencompass.org.cn. |
---|---|
DOI: | 10.48550/arxiv.2407.10990 |