Improving CLIP Robustness with Knowledge Distillation and Self-Training
This paper examines the robustness of a multi-modal computer vision model, CLIP (Contrastive Language-Image Pretraining), in the context of unsupervised learning. The main objective is twofold: first, to evaluate the robustness of CLIP, and second, to explore strategies for augmenting its robustness...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper examines the robustness of a multi-modal computer vision model,
CLIP (Contrastive Language-Image Pretraining), in the context of unsupervised
learning. The main objective is twofold: first, to evaluate the robustness of
CLIP, and second, to explore strategies for augmenting its robustness. To
achieve this, we introduce a novel approach named LP-CLIP. This technique
involves the distillation of CLIP features through the incorporation of a
linear probing layer positioned atop its encoding structure. This newly added
layer is trained utilizing pseudo-labels produced by CLIP, coupled with a
self-training strategy. The LP-CLIP technique offers a promising approach to
enhance the robustness of CLIP without the need for annotations. By leveraging
a simple linear probing layer, we aim to improve the model's ability to
withstand various uncertainties and challenges commonly encountered in
real-world scenarios. Importantly, our approach does not rely on annotated
data, which makes it particularly valuable in situations where labeled data
might be scarce or costly to obtain. Our proposed approach increases the
robustness of CLIP with SOTA results compared to supervised technique on
various datasets. |
---|---|
DOI: | 10.48550/arxiv.2309.10361 |