MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context
Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models a...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Gu, Zishan Yin, Changchang Liu, Fenglin Zhang, Ping |
description | Large Vision Language Models (LVLMs) have recently achieved superior
performance in various tasks on natural image and text data, which inspires a
large amount of studies for LVLMs fine-tuning and training. Despite their
advancements, there has been scant research on the robustness of these models
against hallucination when fine-tuned on smaller datasets. In this study, we
introduce a new benchmark dataset, the Medical Visual Hallucination Test
(MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH
comprises five tasks to evaluate hallucinations in LVLMs within the medical
context, which includes tasks for comprehensive understanding of textual and
visual input, as well as long textual response generation. Our extensive
experiments with both general and medical LVLMs reveal that, although medical
LVLMs demonstrate promising performance on standard medical tasks, they are
particularly susceptible to hallucinations, often more so than the general
models, raising significant concerns about the reliability of these
domain-specific models. For medical LVLMs to be truly valuable in real-world
applications, they must not only accurately integrate medical knowledge but
also maintain robust reasoning abilities to prevent hallucination. Our work
paves the way for future evaluations of these studies. |
doi_str_mv | 10.48550/arxiv.2407.02730 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2407_02730</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2407_02730</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2407_027303</originalsourceid><addsrcrecordid>eNqFjrEOgjAURbs4GPUDnHw_IFaBYFwJhgEmCSt5gYJNSmvagvD3Aro73XtP7nAI2Z-p4119n55QD7x3Lh4NHHoJXLombcqqPL5Bpt6oKwOP0VjWouUlRD2KbmpKgqohRiG6kssvqJWGBHXDIOdmBgnKpsNpp6piwgCXYJ_TYhUvUUCopGWD3ZJVjcKw3S835HCPsjA-LmLFS_MW9VjMgsUi6P5_fADy50a6</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context</title><source>arXiv.org</source><creator>Gu, Zishan ; Yin, Changchang ; Liu, Fenglin ; Zhang, Ping</creator><creatorcontrib>Gu, Zishan ; Yin, Changchang ; Liu, Fenglin ; Zhang, Ping</creatorcontrib><description>Large Vision Language Models (LVLMs) have recently achieved superior
performance in various tasks on natural image and text data, which inspires a
large amount of studies for LVLMs fine-tuning and training. Despite their
advancements, there has been scant research on the robustness of these models
against hallucination when fine-tuned on smaller datasets. In this study, we
introduce a new benchmark dataset, the Medical Visual Hallucination Test
(MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH
comprises five tasks to evaluate hallucinations in LVLMs within the medical
context, which includes tasks for comprehensive understanding of textual and
visual input, as well as long textual response generation. Our extensive
experiments with both general and medical LVLMs reveal that, although medical
LVLMs demonstrate promising performance on standard medical tasks, they are
particularly susceptible to hallucinations, often more so than the general
models, raising significant concerns about the reliability of these
domain-specific models. For medical LVLMs to be truly valuable in real-world
applications, they must not only accurately integrate medical knowledge but
also maintain robust reasoning abilities to prevent hallucination. Our work
paves the way for future evaluations of these studies.</description><identifier>DOI: 10.48550/arxiv.2407.02730</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2407.02730$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2407.02730$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Gu, Zishan</creatorcontrib><creatorcontrib>Yin, Changchang</creatorcontrib><creatorcontrib>Liu, Fenglin</creatorcontrib><creatorcontrib>Zhang, Ping</creatorcontrib><title>MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context</title><description>Large Vision Language Models (LVLMs) have recently achieved superior
performance in various tasks on natural image and text data, which inspires a
large amount of studies for LVLMs fine-tuning and training. Despite their
advancements, there has been scant research on the robustness of these models
against hallucination when fine-tuned on smaller datasets. In this study, we
introduce a new benchmark dataset, the Medical Visual Hallucination Test
(MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH
comprises five tasks to evaluate hallucinations in LVLMs within the medical
context, which includes tasks for comprehensive understanding of textual and
visual input, as well as long textual response generation. Our extensive
experiments with both general and medical LVLMs reveal that, although medical
LVLMs demonstrate promising performance on standard medical tasks, they are
particularly susceptible to hallucinations, often more so than the general
models, raising significant concerns about the reliability of these
domain-specific models. For medical LVLMs to be truly valuable in real-world
applications, they must not only accurately integrate medical knowledge but
also maintain robust reasoning abilities to prevent hallucination. Our work
paves the way for future evaluations of these studies.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjrEOgjAURbs4GPUDnHw_IFaBYFwJhgEmCSt5gYJNSmvagvD3Aro73XtP7nAI2Z-p4119n55QD7x3Lh4NHHoJXLombcqqPL5Bpt6oKwOP0VjWouUlRD2KbmpKgqohRiG6kssvqJWGBHXDIOdmBgnKpsNpp6piwgCXYJ_TYhUvUUCopGWD3ZJVjcKw3S835HCPsjA-LmLFS_MW9VjMgsUi6P5_fADy50a6</recordid><startdate>20240702</startdate><enddate>20240702</enddate><creator>Gu, Zishan</creator><creator>Yin, Changchang</creator><creator>Liu, Fenglin</creator><creator>Zhang, Ping</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240702</creationdate><title>MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context</title><author>Gu, Zishan ; Yin, Changchang ; Liu, Fenglin ; Zhang, Ping</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2407_027303</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Gu, Zishan</creatorcontrib><creatorcontrib>Yin, Changchang</creatorcontrib><creatorcontrib>Liu, Fenglin</creatorcontrib><creatorcontrib>Zhang, Ping</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Gu, Zishan</au><au>Yin, Changchang</au><au>Liu, Fenglin</au><au>Zhang, Ping</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context</atitle><date>2024-07-02</date><risdate>2024</risdate><abstract>Large Vision Language Models (LVLMs) have recently achieved superior
performance in various tasks on natural image and text data, which inspires a
large amount of studies for LVLMs fine-tuning and training. Despite their
advancements, there has been scant research on the robustness of these models
against hallucination when fine-tuned on smaller datasets. In this study, we
introduce a new benchmark dataset, the Medical Visual Hallucination Test
(MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH
comprises five tasks to evaluate hallucinations in LVLMs within the medical
context, which includes tasks for comprehensive understanding of textual and
visual input, as well as long textual response generation. Our extensive
experiments with both general and medical LVLMs reveal that, although medical
LVLMs demonstrate promising performance on standard medical tasks, they are
particularly susceptible to hallucinations, often more so than the general
models, raising significant concerns about the reliability of these
domain-specific models. For medical LVLMs to be truly valuable in real-world
applications, they must not only accurately integrate medical knowledge but
also maintain robust reasoning abilities to prevent hallucination. Our work
paves the way for future evaluations of these studies.</abstract><doi>10.48550/arxiv.2407.02730</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2407.02730 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2407_02730 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition |
title | MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T10%3A42%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MedVH:%20Towards%20Systematic%20Evaluation%20of%20Hallucination%20for%20Large%20Vision%20Language%20Models%20in%20the%20Medical%20Context&rft.au=Gu,%20Zishan&rft.date=2024-07-02&rft_id=info:doi/10.48550/arxiv.2407.02730&rft_dat=%3Carxiv_GOX%3E2407_02730%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |