CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding model...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Data selection has emerged as a core issue for large-scale visual-language
model pretaining (e.g., CLIP), particularly with noisy web-curated datasets.
Three main data selection approaches are: (1) leveraging external non-CLIP
models to aid data selection, (2) training new CLIP-style embedding models that
are more effective at selecting high-quality data than the original OpenAI CLIP
model, and (3) designing better metrics or strategies universally applicable to
any CLIP embedding without requiring specific model properties (e.g., CLIPScore
is one popular metric). While the first two approaches have been extensively
studied, the third remains under-explored. In this paper, we advance the third
approach by proposing two new methods. Firstly, instead of classical CLIP
scores that only consider the alignment between two modalities from a single
sample, we introduce negCLIPLoss, a CLIP loss-inspired method that adds the
alignment between one sample and its contrastive pairs as an extra
normalization term for better quality measurement. Secondly, when downstream
tasks are known, we propose a new norm-based metric, NormSim, to measure the
similarity between pretraining data and target data. We test our methods on the
data selection benchmark, DataComp~\cite{gadre2023datacomp}. Compared to the
best baseline using only OpenAI's CLIP-L/14, our methods achieve a 5.3\%
improvement on ImageNet-1k and a 2.8\% improvement on 38 downstream evaluation
tasks. Moreover, both negCLIPLoss and NormSim are compatible with existing
techniques. By combining our methods with the current best methods
DFN~\cite{fang2023data} and HYPE~\cite{kim2024hype}, we can boost average
performance on downstream tasks by 0.9\%, achieving a new state-of-the-art. |
---|---|
DOI: | 10.48550/arxiv.2405.19547 |