Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities. However, automatically collecting such data (e.g. via large-scale web scraping) leads to low quality and poor image-text correlation, while hum...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Current vision-language generative models rely on expansive corpora of paired
image-text data to attain optimal performance and generalization capabilities.
However, automatically collecting such data (e.g. via large-scale web scraping)
leads to low quality and poor image-text correlation, while human annotation is
more accurate but requires significant manual effort and expense. We introduce
$\textbf{ITIT}$ ($\textbf{I}$n$\textbf{T}$egrating $\textbf{I}$mage
$\textbf{T}$ext): an innovative training paradigm grounded in the concept of
cycle consistency which allows vision-language training on unpaired image and
text data. ITIT is comprised of a joint image-text encoder with disjoint image
and text decoders that enable bidirectional image-to-text and text-to-image
generation in a single framework. During training, ITIT leverages a small set
of paired image-text data to ensure its output matches the input reasonably
well in both directions. Simultaneously, the model is also trained on much
larger datasets containing only images or texts. This is achieved by enforcing
cycle consistency between the original unpaired samples and the cycle-generated
counterparts. For instance, it generates a caption for a given input image and
then uses the caption to create an output image, and enforces similarity
between the input and output images. Our experiments show that ITIT with
unpaired datasets exhibits similar scaling behavior as using high-quality
paired data. We demonstrate image generation and captioning performance on par
with state-of-the-art text-to-image and image-to-text models with orders of
magnitude fewer (only 3M) paired image-text data. |
---|---|
DOI: | 10.48550/arxiv.2310.03734 |