Multimodal Fusion Generative Adversarial Network for Image Synthesis
Text-to-image synthesis has advanced significantly; however, a crucial limitation persists: textual descriptions often neglect essential background details, leading to blurred backgrounds and diminished image quality. To address this, we propose a multimodal fusion framework that integrates informat...
Gespeichert in:
Veröffentlicht in: | IEEE signal processing letters 2024, Vol.31, p.1865-1869 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Text-to-image synthesis has advanced significantly; however, a crucial limitation persists: textual descriptions often neglect essential background details, leading to blurred backgrounds and diminished image quality. To address this, we propose a multimodal fusion framework that integrates information from both text and image modalities. Our approach introduces a background mask to compensate for missing textual descriptions of background elements. Additionally, we employ an adaptive channel attention mechanism to effectively exploit fused features, dynamically accentuating informative feature maps. Furthermore, we introduce a novel fusion conditional loss, ensuring that generated images not only align with textual descriptions but also exhibit realistic backgrounds. Experimental evaluations on the Caltech-UCSD Birds 200 and COCO datasets demonstrate the efficacy of our approach, with our Frechet Inception Distance (FID) achieving a commendable score of 15.38 on the CUB dataset, surpassing several state-of-the-art approaches. |
---|---|
ISSN: | 1070-9908 1558-2361 |
DOI: | 10.1109/LSP.2024.3404855 |