MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Gu, Zishan, Yin, Changchang, Liu, Fenglin, Zhang, Ping
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Gu, Zishan Yin, Changchang Liu, Fenglin Zhang, Ping
description	Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. In this study, we introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH comprises five tasks to evaluate hallucinations in LVLMs within the medical context, which includes tasks for comprehensive understanding of textual and visual input, as well as long textual response generation. Our extensive experiments with both general and medical LVLMs reveal that, although medical LVLMs demonstrate promising performance on standard medical tasks, they are particularly susceptible to hallucinations, often more so than the general models, raising significant concerns about the reliability of these domain-specific models. For medical LVLMs to be truly valuable in real-world applications, they must not only accurately integrate medical knowledge but also maintain robust reasoning abilities to prevent hallucination. Our work paves the way for future evaluations of these studies.
doi_str_mv	10.48550/arxiv.2407.02730
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2407_02730</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2407_02730</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2407_027303</originalsourceid><addsrcrecordid>eNqFjrEOgjAURbs4GPUDnHw_IFaBYFwJhgEmCSt5gYJNSmvagvD3Aro73XtP7nAI2Z-p4119n55QD7x3Lh4NHHoJXLombcqqPL5Bpt6oKwOP0VjWouUlRD2KbmpKgqohRiG6kssvqJWGBHXDIOdmBgnKpsNpp6piwgCXYJ_TYhUvUUCopGWD3ZJVjcKw3S835HCPsjA-LmLFS_MW9VjMgsUi6P5_fADy50a6</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context</title><source>arXiv.org</source><creator>Gu, Zishan ; Yin, Changchang ; Liu, Fenglin ; Zhang, Ping</creator><creatorcontrib>Gu, Zishan ; Yin, Changchang ; Liu, Fenglin ; Zhang, Ping</creatorcontrib><description>Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. In this study, we introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH comprises five tasks to evaluate hallucinations in LVLMs within the medical context, which includes tasks for comprehensive understanding of textual and visual input, as well as long textual response generation. Our extensive experiments with both general and medical LVLMs reveal that, although medical LVLMs demonstrate promising performance on standard medical tasks, they are particularly susceptible to hallucinations, often more so than the general models, raising significant concerns about the reliability of these domain-specific models. For medical LVLMs to be truly valuable in real-world applications, they must not only accurately integrate medical knowledge but also maintain robust reasoning abilities to prevent hallucination. Our work paves the way for future evaluations of these studies.</description><identifier>DOI: 10.48550/arxiv.2407.02730</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2407.02730$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2407.02730$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Gu, Zishan</creatorcontrib><creatorcontrib>Yin, Changchang</creatorcontrib><creatorcontrib>Liu, Fenglin</creatorcontrib><creatorcontrib>Zhang, Ping</creatorcontrib><title>MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context</title><description>Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. In this study, we introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH comprises five tasks to evaluate hallucinations in LVLMs within the medical context, which includes tasks for comprehensive understanding of textual and visual input, as well as long textual response generation. Our extensive experiments with both general and medical LVLMs reveal that, although medical LVLMs demonstrate promising performance on standard medical tasks, they are particularly susceptible to hallucinations, often more so than the general models, raising significant concerns about the reliability of these domain-specific models. For medical LVLMs to be truly valuable in real-world applications, they must not only accurately integrate medical knowledge but also maintain robust reasoning abilities to prevent hallucination. Our work paves the way for future evaluations of these studies.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjrEOgjAURbs4GPUDnHw_IFaBYFwJhgEmCSt5gYJNSmvagvD3Aro73XtP7nAI2Z-p4119n55QD7x3Lh4NHHoJXLombcqqPL5Bpt6oKwOP0VjWouUlRD2KbmpKgqohRiG6kssvqJWGBHXDIOdmBgnKpsNpp6piwgCXYJ_TYhUvUUCopGWD3ZJVjcKw3S835HCPsjA-LmLFS_MW9VjMgsUi6P5_fADy50a6</recordid><startdate>20240702</startdate><enddate>20240702</enddate><creator>Gu, Zishan</creator><creator>Yin, Changchang</creator><creator>Liu, Fenglin</creator><creator>Zhang, Ping</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240702</creationdate><title>MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context</title><author>Gu, Zishan ; Yin, Changchang ; Liu, Fenglin ; Zhang, Ping</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2407_027303</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Gu, Zishan</creatorcontrib><creatorcontrib>Yin, Changchang</creatorcontrib><creatorcontrib>Liu, Fenglin</creatorcontrib><creatorcontrib>Zhang, Ping</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Gu, Zishan</au><au>Yin, Changchang</au><au>Liu, Fenglin</au><au>Zhang, Ping</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context</atitle><date>2024-07-02</date><risdate>2024</risdate><abstract>Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. In this study, we introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH comprises five tasks to evaluate hallucinations in LVLMs within the medical context, which includes tasks for comprehensive understanding of textual and visual input, as well as long textual response generation. Our extensive experiments with both general and medical LVLMs reveal that, although medical LVLMs demonstrate promising performance on standard medical tasks, they are particularly susceptible to hallucinations, often more so than the general models, raising significant concerns about the reliability of these domain-specific models. For medical LVLMs to be truly valuable in real-world applications, they must not only accurately integrate medical knowledge but also maintain robust reasoning abilities to prevent hallucination. Our work paves the way for future evaluations of these studies.</abstract><doi>10.48550/arxiv.2407.02730</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2407.02730
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2407_02730
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
title	MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T10%3A42%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MedVH:%20Towards%20Systematic%20Evaluation%20of%20Hallucination%20for%20Large%20Vision%20Language%20Models%20in%20the%20Medical%20Context&rft.au=Gu,%20Zishan&rft.date=2024-07-02&rft_id=info:doi/10.48550/arxiv.2407.02730&rft_dat=%3Carxiv_GOX%3E2407_02730%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true