CREPE: Can Vision-Language Foundation Models Reason Compositionally?
A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that: across 7 architectures trained with 4 algorithms on massive datasets, they struggle at c...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A fundamental characteristic common to both human vision and natural language
is their compositional nature. Yet, despite the performance gains contributed
by large vision and language pretraining, we find that: across 7 architectures
trained with 4 algorithms on massive datasets, they struggle at
compositionality. To arrive at this conclusion, we introduce a new
compositionality evaluation benchmark, CREPE, which measures two important
aspects of compositionality identified by cognitive science literature:
systematicity and productivity. To measure systematicity, CREPE consists of a
test dataset containing over $370K$ image-text pairs and three different
seen-unseen splits. The three splits are designed to test models trained on
three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also
generate $325K$, $316K$, and $309K$ hard negative captions for a subset of the
pairs. To test productivity, CREPE contains $17K$ image-text pairs with nine
different complexities plus $183K$ hard negative captions with atomic, swapping
and negation foils. The datasets are generated by repurposing the Visual Genome
scene graphs and region descriptions and applying handcrafted templates and
GPT-3. For systematicity, we find that model performance decreases consistently
when novel compositions dominate the retrieval set, with Recall@1 dropping by
up to $12\%$. For productivity, models' retrieval success decays as complexity
increases, frequently nearing random chance at high complexity. These results
hold regardless of model and training dataset size. |
---|---|
DOI: | 10.48550/arxiv.2212.07796 |