Unified Discrete Diffusion for Simultaneous Vision-Language Generation
The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modali...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The recently developed discrete diffusion models perform extraordinarily well
in the text-to-image task, showing significant promise for handling the
multi-modality signals. In this work, we harness these traits and present a
unified multimodal generation model that can conduct both the "modality
translation" and "multi-modality generation" tasks using a single model,
performing text-based, image-based, and even vision-language simultaneous
generation. Specifically, we unify the discrete diffusion process for
multimodal signals by proposing a unified transition matrix. Moreover, we
design a mutual attention module with fused embedding layer and a unified
objective function to emphasise the inter-modal linkages, which are vital for
multi-modality generation. Extensive experiments indicate that our proposed
method can perform comparably to the state-of-the-art solutions in various
generation tasks. |
---|---|
DOI: | 10.48550/arxiv.2211.14842 |