Multimodal Fusion Generative Adversarial Network for Image Synthesis

Text-to-image synthesis has advanced significantly; however, a crucial limitation persists: textual descriptions often neglect essential background details, leading to blurred backgrounds and diminished image quality. To address this, we propose a multimodal fusion framework that integrates informat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE signal processing letters 2024, Vol.31, p.1865-1869
Hauptverfasser: Zhao, Liang, Hu, Qinghao, Li, Xiaoyuan, Zhao, Jingyuan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Text-to-image synthesis has advanced significantly; however, a crucial limitation persists: textual descriptions often neglect essential background details, leading to blurred backgrounds and diminished image quality. To address this, we propose a multimodal fusion framework that integrates information from both text and image modalities. Our approach introduces a background mask to compensate for missing textual descriptions of background elements. Additionally, we employ an adaptive channel attention mechanism to effectively exploit fused features, dynamically accentuating informative feature maps. Furthermore, we introduce a novel fusion conditional loss, ensuring that generated images not only align with textual descriptions but also exhibit realistic backgrounds. Experimental evaluations on the Caltech-UCSD Birds 200 and COCO datasets demonstrate the efficacy of our approach, with our Frechet Inception Distance (FID) achieving a commendable score of 15.38 on the CUB dataset, surpassing several state-of-the-art approaches.
ISSN:1070-9908
1558-2361
DOI:10.1109/LSP.2024.3404855