Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers
An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained task...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | An effective method for combining frozen large language models (LLM) and
visual encoders involves a resampler module that creates a `visual prompt'
which is provided to the LLM, along with the textual prompt. While this
approach has enabled impressive performance across many coarse-grained tasks
like image captioning and visual question answering, more fine-grained tasks
that require spatial understanding have not been thoroughly examined. In this
paper, we use \textit{diagnostic classifiers} to measure the extent to which
the visual prompt produced by the resampler encodes spatial information. Our
results show that this information is largely absent from the resampler output
when kept frozen during training of the classifiers. However, when the
resampler and classifier are trained jointly, we observe a significant
performance boost. This shows that the compression achieved by the resamplers
can in principle encode the requisite spatial information, but that more
object-aware objectives are needed at the pretraining stage to facilitate this
capability |
---|---|
DOI: | 10.48550/arxiv.2404.13594 |