MaskInversion: Localized Embeddings via Optimization of Explainability Maps
Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. % To address this problem, we propose MaskInversion, a method that leverages the feature repres...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Vision-language foundation models such as CLIP have achieved tremendous
results in global vision-language alignment, but still show some limitations in
creating representations for specific image regions. % To address this problem,
we propose MaskInversion, a method that leverages the feature representations
of pre-trained foundation models, such as CLIP, to generate a context-aware
embedding for a query image region specified by a mask at test time.
MaskInversion starts with initializing an embedding token and compares its
explainability map, derived from the foundation model, to the query mask. The
embedding token is then subsequently refined to approximate the query region by
minimizing the discrepancy between its explainability map and the query mask.
During this process, only the embedding vector is updated, while the underlying
foundation model is kept frozen allowing to use MaskInversion with any
pre-trained model. As deriving the explainability map involves computing its
gradient, which can be expensive, we propose a gradient decomposition strategy
that simplifies this computation. The learned region representation can be used
for a broad range of tasks, including open-vocabulary class retrieval,
referring expression comprehension, as well as for localized captioning and
image generation. We evaluate the proposed method on all those tasks on several
datasets such as PascalVOC, MSCOCO, RefCOCO, and OpenImagesV7 and show its
capabilities compared to other SOTA approaches. |
---|---|
DOI: | 10.48550/arxiv.2407.20034 |