UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks
Large-scale joint training of multimodal models, e.g., CLIP, have demonstrated great performance in many vision-language tasks. However, image-text pairs for pre-training are restricted to the intersection of images and texts, limiting their ability to cover a large distribution of real-world data,...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large-scale joint training of multimodal models, e.g., CLIP, have
demonstrated great performance in many vision-language tasks. However,
image-text pairs for pre-training are restricted to the intersection of images
and texts, limiting their ability to cover a large distribution of real-world
data, where noise can also be introduced as misaligned pairs during
pre-processing. Conversely, unimodal models trained on text or image data alone
through unsupervised techniques can achieve broader coverage of diverse
real-world data and are not constrained by the requirement of simultaneous
presence of image and text. In this paper, we demonstrate that using
large-scale unsupervised unimodal models as pre-training can enhance the
zero-shot performance of image-text pair models. Our thorough studies validate
that models pre-trained as such can learn rich representations of both
modalities, improving their ability to understand how images and text relate to
each other. Our experiments show that unimodal pre-training outperforms
state-of-the-art CLIP-based models by 6.5% (52.3% $\rightarrow$ 58.8%) on
PASCAL-5$^i$ and 6.2% (27.2% $\rightarrow$ 33.4%) on COCO-20$^i$ semantic
segmentation under zero-shot setting respectively. By learning representations
of both modalities, unimodal pre-training offers broader coverage, reduced
misalignment errors, and the ability to capture more complex features and
patterns in the real-world data resulting in better performance especially for
zero-shot vision-language tasks. |
---|---|
DOI: | 10.48550/arxiv.2306.04715 |