CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions
Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synth...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Previous works show that noisy, web-crawled image-text pairs may limit
vision-language pretraining like CLIP and propose learning with synthetic
captions as a promising alternative. Our work continues this effort,
introducing two simple yet effective designs to better leverage richly
described synthetic captions. Firstly, by observing a strong inverse effect in
learning with synthetic captions -- the short synthetic captions can generally
lead to MUCH higher performance than full-length ones -- we therefore fed only
partial synthetic captions to the text encoder. Secondly, we incorporate an
autoregressive captioner to mimic the recaptioning process -- by conditioning
on the paired image input and web-crawled text description, the captioner
learns to predict the full-length synthetic caption generated by advanced
MLLMs. Experiments show that our framework significantly improves zero-shot
performance in cross-modal retrieval tasks, setting new SOTA results on MSCOCO
and Flickr30K. Moreover, such trained vision encoders can enhance the visual
capability of LLaVA, showing strong improvements on a range of MLLM benchmarks.
Our project page is https://ucsc-vlaa.github.io/CLIPS/. |
---|---|
DOI: | 10.48550/arxiv.2411.16828 |