PRO-CLIP: A CLIP-Based Category Measurement Network Through Prototype and Regularized Optimal Transportation

In unstructured environments, robots are likely to encounter desktop objects that they have never seen before. Classifying these objects precisely is a prerequisite for accomplishing object-specific manipulation tasks. However, it is time-consuming to collect large-scale object classification datase...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on instrumentation and measurement 2024, Vol.73, p.1-18
Hauptverfasser: Cao, He, Zhang, Yunzhou, Zhu, Shangdong, Wang, Lei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In unstructured environments, robots are likely to encounter desktop objects that they have never seen before. Classifying these objects precisely is a prerequisite for accomplishing object-specific manipulation tasks. However, it is time-consuming to collect large-scale object classification datasets. Inspired by the prompt tuning methods, we propose the PRO-CLIP network, which is a category measurement method for desktop objects. Specifically, PRO-CLIP performs few-shot classification based on the knowledge from pretrained vision-language model (VLM). It utilizes token-level and prompt-level optimal transportations (OTs) to jointly fine-tune the learnable vision-language prompts. For token-level stage, we propose the image patch reweighting (PR) mechanism to make alignments focus on the image patches that are close to the patch prototypes. This allows the patch embeddings have converging category representations, which reduces intraclass differences of visual features. For prompt-level stage, we propose a cascading OT (COT) module to simultaneously consider task-irrelevant knowledge in zero-shot features and task-specific knowledge in few-shot features. Due to the generalization performance of task-irrelevant knowledge, the proposed module achieves feature regularization during OT. Finally, we propose the UP loss to supervise the whole network. It contains unbalanced logit-level consistency losses and visual prototype loss. The logit-level consistency losses are used to make learnable features close to zero-shot features. The prototype loss makes the visual features approach to the corresponding prototypes in distance. We demonstrate the effectiveness of our method by performing few-shot classification experiments on different datasets including desktop objects. The relevant code will be available at https://github.com/NeuCV-IRMI/proclip .
ISSN:0018-9456
1557-9662
DOI:10.1109/TIM.2024.3485403