Enhancing Visual Classification using Comparative Descriptors
The performance of vision-language models (VLMs), such as CLIP, in visual classification tasks, has been enhanced by leveraging semantic knowledge from large language models (LLMs), including GPT. Recent studies have shown that in zero-shot classification tasks, descriptors incorporating additional...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The performance of vision-language models (VLMs), such as CLIP, in visual
classification tasks, has been enhanced by leveraging semantic knowledge from
large language models (LLMs), including GPT. Recent studies have shown that in
zero-shot classification tasks, descriptors incorporating additional cues,
high-level concepts, or even random characters often outperform those using
only the category name. In many classification tasks, while the top-1 accuracy
may be relatively low, the top-5 accuracy is often significantly higher. This
gap implies that most misclassifications occur among a few similar classes,
highlighting the model's difficulty in distinguishing between classes with
subtle differences. To address this challenge, we introduce a novel concept of
comparative descriptors. These descriptors emphasize the unique features of a
target class against its most similar classes, enhancing differentiation. By
generating and integrating these comparative descriptors into the
classification framework, we refine the semantic focus and improve
classification accuracy. An additional filtering process ensures that these
descriptors are closer to the image embeddings in the CLIP space, further
enhancing performance. Our approach demonstrates improved accuracy and
robustness in visual classification tasks by addressing the specific challenge
of subtle inter-class differences. |
---|---|
DOI: | 10.48550/arxiv.2411.05357 |