MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation m...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a
profound capability in multimodal understanding. However, the simultaneous
generation of images with coherent texts is still underdeveloped. Addressing
this, we introduce a novel interleaved vision-and-language generation method,
centered around the concept of ``generative vokens". These vokens serve as
pivotal elements contributing to coherent image-text outputs. Our method is
marked by a unique two-stage training strategy for description-free multimodal
generation, which does not necessitate extensive descriptions of images. We
integrate classifier-free guidance to enhance the alignment of generated images
and texts, ensuring more seamless and contextually relevant multimodal
interactions. Our model, MiniGPT-5, exhibits substantial improvement over the
baseline models on multimodal generation datasets, including MMDialog and VIST.
The human evaluation shows MiniGPT-5 is better than the baseline model on more
than 56\% cases for multimodal generation, highlighting its efficacy across
diverse benchmarks. |
---|---|
DOI: | 10.48550/arxiv.2310.02239 |