CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models
Multimodal foundation models are prone to hallucination, generating outputs that either contradict the input or are not grounded by factual information. Given the diversity in architectures, training data and instruction tuning techniques, there can be large variations in systems' susceptibilit...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Multimodal foundation models are prone to hallucination, generating outputs
that either contradict the input or are not grounded by factual information.
Given the diversity in architectures, training data and instruction tuning
techniques, there can be large variations in systems' susceptibility to
hallucinations. To assess system hallucination robustness, hallucination
ranking approaches have been developed for specific tasks such as image
captioning, question answering, summarization, or biography generation.
However, these approaches typically compare model outputs to gold-standard
references or labels, limiting hallucination benchmarking for new domains. This
work proposes "CrossCheckGPT", a reference-free universal hallucination ranking
for multimodal foundation models. The core idea of CrossCheckGPT is that the
same hallucinated content is unlikely to be generated by different independent
systems, hence cross-system consistency can provide meaningful and accurate
hallucination assessment scores. CrossCheckGPT can be applied to any model or
task, provided that the information consistency between outputs can be measured
through an appropriate distance metric. Focusing on multimodal large language
models that generate text, we explore two information consistency measures:
CrossCheck-explicit and CrossCheck-implicit. We showcase the applicability of
our method for hallucination ranking across various modalities, namely the
text, image, and audio-visual domains. Further, we propose the first
audio-visual hallucination benchmark, "AVHalluBench", and illustrate the
effectiveness of CrossCheckGPT, achieving correlations of 98% and 89% with
human judgements on MHaluBench and AVHalluBench, respectively. |
---|---|
DOI: | 10.48550/arxiv.2405.13684 |