VLind-Bench: Measuring Language Priors in Large Vision-Language Models
Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of la...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large Vision-Language Models (LVLMs) have demonstrated outstanding
performance across various multimodal tasks. However, they suffer from a
problem known as language prior, where responses are generated based solely on
textual patterns while disregarding image information. Addressing the issue of
language prior is crucial, as it can lead to undesirable biases or
hallucinations when dealing with images that are out of training distribution.
Despite its importance, current methods for accurately measuring language
priors in LVLMs are poorly studied. Although existing benchmarks based on
counterfactual or out-of-distribution images can partially be used to measure
language priors, they fail to disentangle language priors from other
confounding factors. To this end, we propose a new benchmark called
VLind-Bench, which is the first benchmark specifically designed to measure the
language priors, or blindness, of LVLMs. It not only includes tests on
counterfactual images to assess language priors but also involves a series of
tests to evaluate more basic capabilities such as commonsense knowledge, visual
perception, and commonsense biases. For each instance in our benchmark, we
ensure that all these basic tests are passed before evaluating the language
priors, thereby minimizing the influence of other factors on the assessment.
The evaluation and analysis of recent LVLMs in our benchmark reveal that almost
all models exhibit a significant reliance on language priors, presenting a
strong challenge in the field. |
---|---|
DOI: | 10.48550/arxiv.2406.08702 |