RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning
Recent advancements in Large Language Models (LLMs) and Large Multi-modal Models (LMMs) have shown potential in various medical applications, such as Intelligent Medical Diagnosis. Although impressive results have been achieved, we find that existing benchmarks do not reflect the complexity of real...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent advancements in Large Language Models (LLMs) and Large Multi-modal
Models (LMMs) have shown potential in various medical applications, such as
Intelligent Medical Diagnosis. Although impressive results have been achieved,
we find that existing benchmarks do not reflect the complexity of real medical
reports and specialized in-depth reasoning capabilities. In this work, we
introduced RJUA-MedDQA, a comprehensive benchmark in the field of medical
specialization, which poses several challenges: comprehensively interpreting
imgage content across diverse challenging layouts, possessing numerical
reasoning ability to identify abnormal indicators and demonstrating clinical
reasoning ability to provide statements of disease diagnosis, status and advice
based on medical contexts. We carefully design the data generation pipeline and
proposed the Efficient Structural Restoration Annotation (ESRA) Method, aimed
at restoring textual and tabular content in medical report images. This method
substantially enhances annotation efficiency, doubling the productivity of each
annotator, and yields a 26.8% improvement in accuracy. We conduct extensive
evaluations, including few-shot assessments of 5 LMMs which are capable of
solving Chinese medical QA tasks. To further investigate the limitations and
potential of current LMMs, we conduct comparative experiments on a set of
strong LLMs by using image-text generated by ESRA method. We report the
performance of baselines and offer several observations: (1) The overall
performance of existing LMMs is still limited; however LMMs more robust to
low-quality and diverse-structured images compared to LLMs. (3) Reasoning
across context and image content present significant challenges. We hope this
benchmark helps the community make progress on these challenging tasks in
multi-modal medical document understanding and facilitate its application in
healthcare. |
---|---|
DOI: | 10.48550/arxiv.2402.14840 |