Image-Caption Encoding for Improving Zero-Shot Generalization
Recent advances in vision-language models have combined contrastive approaches with generative methods to achieve state-of-the-art (SOTA) on downstream inference tasks like zero-shot image classification. However, a persistent issue of these models for image classification is their out-of-distributi...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent advances in vision-language models have combined contrastive
approaches with generative methods to achieve state-of-the-art (SOTA) on
downstream inference tasks like zero-shot image classification. However, a
persistent issue of these models for image classification is their
out-of-distribution (OOD) generalization capabilities. We first show that when
an OOD data point is misclassified, the correct class can be typically found in
the Top-K predicted classes. In order to steer the model prediction toward the
correct class within the top predicted classes, we propose the Image-Caption
Encoding (ICE) method, a straightforward approach that directly enforces
consistency between the image-conditioned and caption-conditioned predictions
at evaluation time only. Intuitively, we take advantage of unique properties of
the generated captions to guide our local search for the correct class label
within the Top-K predicted classes. We show that our method can be easily
combined with other SOTA methods to enhance Top-1 OOD accuracies by 0.5% on
average and up to 3% on challenging datasets. Our code:
https://github.com/Chris210634/ice |
---|---|
DOI: | 10.48550/arxiv.2402.02662 |