FiVL: A Framework for Improved Vision-Language Alignment
Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to for...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large Vision Language Models (LVLMs) have achieved significant progress in
integrating visual and textual inputs for multimodal reasoning. However, a
recurring challenge is ensuring these models utilize visual information as
effectively as linguistic content when both modalities are necessary to
formulate an accurate answer. We hypothesize that hallucinations arise due to
the lack of effective visual grounding in current LVLMs. This issue extends to
vision-language benchmarks, where it is difficult to make the image
indispensable for accurate answer generation, particularly in vision
question-answering tasks. In this work, we introduce FiVL, a novel method for
constructing datasets designed to train LVLMs for enhanced visual grounding and
to evaluate their effectiveness in achieving it. These datasets can be utilized
for both training and assessing an LVLM's ability to use image content as
substantive evidence rather than relying solely on linguistic priors, providing
insights into the model's reliance on visual information. To demonstrate the
utility of our dataset, we introduce an innovative training task that
outperforms baselines alongside a validation method and application for
explainability. The code is available at https://github.com/IntelLabs/fivl. |
---|---|
DOI: | 10.48550/arxiv.2412.14672 |