Towards Interpreting Visual Information Processing in Vision-Language Models
Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the localization of object information, the evolution of visual tok...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Vision-Language Models (VLMs) are powerful tools for processing and
understanding text and images. We study the processing of visual tokens in the
language model component of LLaVA, a prominent VLM. Our approach focuses on
analyzing the localization of object information, the evolution of visual token
representations across layers, and the mechanism of integrating visual
information for predictions. Through ablation studies, we demonstrated that
object identification accuracy drops by over 70\% when object-specific tokens
are removed. We observed that visual token representations become increasingly
interpretable in the vocabulary space across layers, suggesting an alignment
with textual tokens corresponding to image content. Finally, we found that the
model extracts object information from these refined representations at the
last token position for prediction, mirroring the process in text-only language
models for factual association tasks. These findings provide crucial insights
into how VLMs process and integrate visual information, bridging the gap
between our understanding of language and vision models, and paving the way for
more interpretable and controllable multimodal systems. |
---|---|
DOI: | 10.48550/arxiv.2410.07149 |