Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning, leading to state-of-the-art models for various downstream multimodal tasks. However, recent research has highlighted severe limitations of these models in their ability to p...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Contrastively trained vision-language models have achieved remarkable
progress in vision and language representation learning, leading to
state-of-the-art models for various downstream multimodal tasks. However,
recent research has highlighted severe limitations of these models in their
ability to perform compositional reasoning over objects, attributes, and
relations. Scene graphs have emerged as an effective way to understand images
compositionally. These are graph-structured semantic representations of images
that contain objects, their attributes, and relations with other objects in a
scene. In this work, we consider the scene graph parsed from text as a proxy
for the image scene graph and propose a graph decomposition and augmentation
framework along with a coarse-to-fine contrastive learning objective between
images and text that aligns sentences of various complexities to the same
image. Along with this, we propose novel negative mining techniques in the
scene graph space for improving attribute binding and relation understanding.
Through extensive experiments, we demonstrate the effectiveness of our approach
that significantly improves attribute binding, relation understanding,
systematic generalization, and productivity on multiple recently proposed
benchmarks (For example, improvements upto $18\%$ for systematic
generalization, $16.5\%$ for relation understanding over a strong baseline),
while achieving similar or better performance than CLIP on various general
multimodal tasks. |
---|---|
DOI: | 10.48550/arxiv.2305.13812 |