Jointly Training Large Autoregressive Multimodal Models
In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To add...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In recent years, advances in the large-scale pretraining of language and
text-to-image models have revolutionized the field of machine learning. Yet,
integrating these two modalities into a single, robust model capable of
generating seamless multimodal outputs remains a significant challenge. To
address this gap, we present the Joint Autoregressive Mixture (JAM) framework,
a modular approach that systematically fuses existing text and image generation
models. We also introduce a specialized, data-efficient instruction-tuning
strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned
model demonstrates unparalleled performance in generating high-quality
multimodal outputs and represents the first model explicitly designed for this
purpose. |
---|---|
DOI: | 10.48550/arxiv.2309.15564 |