Pseudo-triplet Guided Few-shot Composed Image Retrieval
Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image with a multimodal query, i.e., a reference image, and its complementary modification text. As previous supervised or zero-shot learning paradigms all fail to strike a good trade-off between the model's g...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Composed Image Retrieval (CIR) is a challenging task that aims to retrieve
the target image with a multimodal query, i.e., a reference image, and its
complementary modification text. As previous supervised or zero-shot learning
paradigms all fail to strike a good trade-off between the model's
generalization ability and retrieval performance, recent researchers have
introduced the task of few-shot CIR (FS-CIR) and proposed a textual
inversion-based network based on pretrained CLIP model to realize it. Despite
its promising performance, the approach encounters two key limitations: simply
relying on the few annotated samples for CIR model training and
indiscriminately selecting training triplets for CIR model fine-tuning. To
address these two limitations, we propose a novel two-stage pseudo triplet
guided few-shot CIR scheme, dubbed PTG-FSCIR. In the first stage, we propose an
attentive masking and captioning-based pseudo triplet generation method, to
construct pseudo triplets from pure image data and use them to fulfill the
CIR-task specific pertaining. In the second stage, we propose a challenging
triplet-based CIR fine-tuning method, where we design a pseudo modification
text-based sample challenging score estimation strategy and a robust top
range-based random sampling strategy for sampling robust challenging triplets
to promote the model fine-tuning. Notably, our scheme is plug-and-play and
compatible with any existing supervised CIR models. We test our scheme across
two backbones on three public datasets (i.e., FashionIQ, CIRR, and
Birds-to-Words), achieving maximum improvements of 13.3%, 22.2%, and 17.4%
respectively, demonstrating our scheme's efficacy. |
---|---|
DOI: | 10.48550/arxiv.2407.06001 |