PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset

Multimodal Large Language Models (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions ab...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Liu, Jiazhen, Fu, Yuhan, Xie, Ruobing, Xie, Runquan, Sun, Xingwu, Lian, Fengzong, Kang, Zhanhui, Li, Xirong
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Liu, Jiazhen Fu, Yuhan Xie, Ruobing Xie, Runquan Sun, Xingwu Lian, Fengzong Kang, Zhanhui Li, Xirong
description	Multimodal Large Language Models (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions about specific images to assess its susceptibility to hallucination. Depending on what to ask (objects, attributes, sentiment, etc.) and how the questions are asked, we structure PhD along two dimensions, i.e., task and mode. Five visual recognition tasks, ranging from low-level (object / attribute recognition) to middle-level (sentiment / position recognition and counting), are considered. Besides a normal visual QA mode, which we term PhD-base, PhD also asks questions with inaccurate context (PhD-iac) or with incorrect context (PhD-icc), or with AI-generated counter common sense images (PhD-ccs). We construct PhD by a ChatGPT-assisted semi-automated pipeline, encompassing four pivotal modules: task-specific hallucinatory item (hitem) selection, hitem-embedded question generation, inaccurate / incorrect context generation, and counter-common-sense (CCS) image generation. With over 14k daily images, 750 CCS images and 102k VQA triplets in total, PhD reveals considerable variability in MLLMs' performance across various modes and tasks, offering valuable insights into the nature of hallucination. As such, PhD stands as a potent tool not only for VHE but may also play a significant role in the refinement of MLLMs.
doi_str_mv	10.48550/arxiv.2403.11116
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2403_11116</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2403_11116</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2403_111163</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1jMEAjNOBvuADBcrBUcF54zEEveAEN2AovzcgpLUFIWwzOLSxByFjMScnNLkzLzEksz8PAXXssScUgjTJbEksTi1hIeBNS0xpziVF0pzM8i7uYY4e-iCrYovKMrMTSyqjAdZGQ-20piwCgA8EzWD</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset</title><source>arXiv.org</source><creator>Liu, Jiazhen ; Fu, Yuhan ; Xie, Ruobing ; Xie, Runquan ; Sun, Xingwu ; Lian, Fengzong ; Kang, Zhanhui ; Li, Xirong</creator><creatorcontrib>Liu, Jiazhen ; Fu, Yuhan ; Xie, Ruobing ; Xie, Runquan ; Sun, Xingwu ; Lian, Fengzong ; Kang, Zhanhui ; Li, Xirong</creatorcontrib><description>Multimodal Large Language Models (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions about specific images to assess its susceptibility to hallucination. Depending on what to ask (objects, attributes, sentiment, etc.) and how the questions are asked, we structure PhD along two dimensions, i.e., task and mode. Five visual recognition tasks, ranging from low-level (object / attribute recognition) to middle-level (sentiment / position recognition and counting), are considered. Besides a normal visual QA mode, which we term PhD-base, PhD also asks questions with inaccurate context (PhD-iac) or with incorrect context (PhD-icc), or with AI-generated counter common sense images (PhD-ccs). We construct PhD by a ChatGPT-assisted semi-automated pipeline, encompassing four pivotal modules: task-specific hallucinatory item (hitem) selection, hitem-embedded question generation, inaccurate / incorrect context generation, and counter-common-sense (CCS) image generation. With over 14k daily images, 750 CCS images and 102k VQA triplets in total, PhD reveals considerable variability in MLLMs' performance across various modes and tasks, offering valuable insights into the nature of hallucination. As such, PhD stands as a potent tool not only for VHE but may also play a significant role in the refinement of MLLMs.</description><identifier>DOI: 10.48550/arxiv.2403.11116</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2403.11116$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.11116$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Liu, Jiazhen</creatorcontrib><creatorcontrib>Fu, Yuhan</creatorcontrib><creatorcontrib>Xie, Ruobing</creatorcontrib><creatorcontrib>Xie, Runquan</creatorcontrib><creatorcontrib>Sun, Xingwu</creatorcontrib><creatorcontrib>Lian, Fengzong</creatorcontrib><creatorcontrib>Kang, Zhanhui</creatorcontrib><creatorcontrib>Li, Xirong</creatorcontrib><title>PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset</title><description>Multimodal Large Language Models (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions about specific images to assess its susceptibility to hallucination. Depending on what to ask (objects, attributes, sentiment, etc.) and how the questions are asked, we structure PhD along two dimensions, i.e., task and mode. Five visual recognition tasks, ranging from low-level (object / attribute recognition) to middle-level (sentiment / position recognition and counting), are considered. Besides a normal visual QA mode, which we term PhD-base, PhD also asks questions with inaccurate context (PhD-iac) or with incorrect context (PhD-icc), or with AI-generated counter common sense images (PhD-ccs). We construct PhD by a ChatGPT-assisted semi-automated pipeline, encompassing four pivotal modules: task-specific hallucinatory item (hitem) selection, hitem-embedded question generation, inaccurate / incorrect context generation, and counter-common-sense (CCS) image generation. With over 14k daily images, 750 CCS images and 102k VQA triplets in total, PhD reveals considerable variability in MLLMs' performance across various modes and tasks, offering valuable insights into the nature of hallucination. As such, PhD stands as a potent tool not only for VHE but may also play a significant role in the refinement of MLLMs.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1jMEAjNOBvuADBcrBUcF54zEEveAEN2AovzcgpLUFIWwzOLSxByFjMScnNLkzLzEksz8PAXXssScUgjTJbEksTi1hIeBNS0xpziVF0pzM8i7uYY4e-iCrYovKMrMTSyqjAdZGQ-20piwCgA8EzWD</recordid><startdate>20240317</startdate><enddate>20240317</enddate><creator>Liu, Jiazhen</creator><creator>Fu, Yuhan</creator><creator>Xie, Ruobing</creator><creator>Xie, Runquan</creator><creator>Sun, Xingwu</creator><creator>Lian, Fengzong</creator><creator>Kang, Zhanhui</creator><creator>Li, Xirong</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240317</creationdate><title>PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset</title><author>Liu, Jiazhen ; Fu, Yuhan ; Xie, Ruobing ; Xie, Runquan ; Sun, Xingwu ; Lian, Fengzong ; Kang, Zhanhui ; Li, Xirong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2403_111163</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Liu, Jiazhen</creatorcontrib><creatorcontrib>Fu, Yuhan</creatorcontrib><creatorcontrib>Xie, Ruobing</creatorcontrib><creatorcontrib>Xie, Runquan</creatorcontrib><creatorcontrib>Sun, Xingwu</creatorcontrib><creatorcontrib>Lian, Fengzong</creatorcontrib><creatorcontrib>Kang, Zhanhui</creatorcontrib><creatorcontrib>Li, Xirong</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Liu, Jiazhen</au><au>Fu, Yuhan</au><au>Xie, Ruobing</au><au>Xie, Runquan</au><au>Sun, Xingwu</au><au>Lian, Fengzong</au><au>Kang, Zhanhui</au><au>Li, Xirong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset</atitle><date>2024-03-17</date><risdate>2024</risdate><abstract>Multimodal Large Language Models (MLLMs) hallucinate, resulting in an emerging topic of visual hallucination evaluation (VHE). This paper contributes a ChatGPT-Prompted visual hallucination evaluation Dataset (PhD) for objective VHE at a large scale. The essence of VHE is to ask an MLLM questions about specific images to assess its susceptibility to hallucination. Depending on what to ask (objects, attributes, sentiment, etc.) and how the questions are asked, we structure PhD along two dimensions, i.e., task and mode. Five visual recognition tasks, ranging from low-level (object / attribute recognition) to middle-level (sentiment / position recognition and counting), are considered. Besides a normal visual QA mode, which we term PhD-base, PhD also asks questions with inaccurate context (PhD-iac) or with incorrect context (PhD-icc), or with AI-generated counter common sense images (PhD-ccs). We construct PhD by a ChatGPT-assisted semi-automated pipeline, encompassing four pivotal modules: task-specific hallucinatory item (hitem) selection, hitem-embedded question generation, inaccurate / incorrect context generation, and counter-common-sense (CCS) image generation. With over 14k daily images, 750 CCS images and 102k VQA triplets in total, PhD reveals considerable variability in MLLMs' performance across various modes and tasks, offering valuable insights into the nature of hallucination. As such, PhD stands as a potent tool not only for VHE but may also play a significant role in the refinement of MLLMs.</abstract><doi>10.48550/arxiv.2403.11116</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2403.11116
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2403_11116
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
title	PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-14T21%3A49%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PhD:%20A%20ChatGPT-Prompted%20Visual%20hallucination%20Evaluation%20Dataset&rft.au=Liu,%20Jiazhen&rft.date=2024-03-17&rft_id=info:doi/10.48550/arxiv.2403.11116&rft_dat=%3Carxiv_GOX%3E2403_11116%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true