Uncertainty-Aware Evaluation for Vision-Language Models
Vision-Language Models like GPT-4, LLaVA, and CogVLM have surged in popularity recently due to their impressive performance in several vision-language tasks. Current evaluation methods, however, overlook an essential component: uncertainty, which is crucial for a comprehensive assessment of VLMs. Ad...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Vision-Language Models like GPT-4, LLaVA, and CogVLM have surged in
popularity recently due to their impressive performance in several
vision-language tasks. Current evaluation methods, however, overlook an
essential component: uncertainty, which is crucial for a comprehensive
assessment of VLMs. Addressing this oversight, we present a benchmark
incorporating uncertainty quantification into evaluating VLMs.
Our analysis spans 20+ VLMs, focusing on the multiple-choice Visual Question
Answering (VQA) task. We examine models on 5 datasets that evaluate various
vision-language capabilities.
Using conformal prediction as an uncertainty estimation approach, we
demonstrate that the models' uncertainty is not aligned with their accuracy.
Specifically, we show that models with the highest accuracy may also have the
highest uncertainty, which confirms the importance of measuring it for VLMs.
Our empirical findings also reveal a correlation between model uncertainty and
its language model part. |
---|---|
DOI: | 10.48550/arxiv.2402.14418 |