ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical Dialogue
The rocketing prosperity of large language models (LLMs) in recent years has boosted the prevalence of vision-language models (VLMs) in the medical sector. In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The rocketing prosperity of large language models (LLMs) in recent years has
boosted the prevalence of vision-language models (VLMs) in the medical sector.
In our online medical consultation scenario, a doctor responds to the texts and
images provided by a patient in multiple rounds to diagnose her/his health
condition, forming a multi-turn multimodal medical dialogue format. Unlike
high-quality images captured by professional equipment in traditional medical
visual question answering (Med-VQA), the images in our case are taken by
patients' mobile phones. These images have poor quality control, with issues
such as excessive background elements and the lesion area being significantly
off-center, leading to degradation of vision-language alignment in the model
training phase. In this paper, we propose ZALM3, a Zero-shot strategy to
improve vision-language ALignment in Multi-turn Multimodal Medical dialogue.
Since we observe that the preceding text conversations before an image can
infer the regions of interest (RoIs) in the image, ZALM3 employs an LLM to
summarize the keywords from the preceding context and a visual grounding model
to extract the RoIs. The updated images eliminate unnecessary background noise
and provide more effective vision-language alignment. To better evaluate our
proposed method, we design a new subjective assessment metric for multi-turn
unimodal/multimodal medical dialogue to provide a fine-grained performance
comparison. Our experiments across three different clinical departments
remarkably demonstrate the efficacy of ZALM3 with statistical significance. |
---|---|
DOI: | 10.48550/arxiv.2409.17610 |