PRO-CLIP: A CLIP-Based Category Measurement Network Through Prototype and Regularized Optimal Transportation
In unstructured environments, robots are likely to encounter desktop objects that they have never seen before. Classifying these objects precisely is a prerequisite for accomplishing object-specific manipulation tasks. However, it is time-consuming to collect large-scale object classification datase...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on instrumentation and measurement 2024, Vol.73, p.1-18 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In unstructured environments, robots are likely to encounter desktop objects that they have never seen before. Classifying these objects precisely is a prerequisite for accomplishing object-specific manipulation tasks. However, it is time-consuming to collect large-scale object classification datasets. Inspired by the prompt tuning methods, we propose the PRO-CLIP network, which is a category measurement method for desktop objects. Specifically, PRO-CLIP performs few-shot classification based on the knowledge from pretrained vision-language model (VLM). It utilizes token-level and prompt-level optimal transportations (OTs) to jointly fine-tune the learnable vision-language prompts. For token-level stage, we propose the image patch reweighting (PR) mechanism to make alignments focus on the image patches that are close to the patch prototypes. This allows the patch embeddings have converging category representations, which reduces intraclass differences of visual features. For prompt-level stage, we propose a cascading OT (COT) module to simultaneously consider task-irrelevant knowledge in zero-shot features and task-specific knowledge in few-shot features. Due to the generalization performance of task-irrelevant knowledge, the proposed module achieves feature regularization during OT. Finally, we propose the UP loss to supervise the whole network. It contains unbalanced logit-level consistency losses and visual prototype loss. The logit-level consistency losses are used to make learnable features close to zero-shot features. The prototype loss makes the visual features approach to the corresponding prototypes in distance. We demonstrate the effectiveness of our method by performing few-shot classification experiments on different datasets including desktop objects. The relevant code will be available at https://github.com/NeuCV-IRMI/proclip . |
---|---|
ISSN: | 0018-9456 1557-9662 |
DOI: | 10.1109/TIM.2024.3485403 |