Lightweight dynamic conditional GAN with pyramid attention for text-to-image synthesis

•We propose a Conditional Manipulating Modular (CM-M) in Conditional Manipulating Block (CM-B) to compensate semantic information.•We develop a Pyramid Attention Refine Block (PAR-B) to capture multi-scale context.•The perceptual loss L1 and image-consistency loss L2 are used to optimize the generat...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition 2021-02, Vol.110, p.107384, Article 107384
Hauptverfasser: Gao, Lianli, Chen, Daiyuan, Zhao, Zhou, Shao, Jie, Shen, Heng Tao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•We propose a Conditional Manipulating Modular (CM-M) in Conditional Manipulating Block (CM-B) to compensate semantic information.•We develop a Pyramid Attention Refine Block (PAR-B) to capture multi-scale context.•The perceptual loss L1 and image-consistency loss L2 are used to optimize the generator to improve the sharpness and consistency of generated images. The text-to-image synthesis task aims to generate photographic images conditioned on semantic text descriptions. To ensure the sharpness and fidelity of generated images, this task tends to generate high-resolution images (e.g., 1282 or 2562). However, as the resolution increases, the network parameters and complexity increases dramatically. Recent works introduce network structures with extensive parameters and heavy computations to guarantee the production of high-resolution images. As a result, these models come across problems of the unstable training process and high training cost. To tackle these issues, in this paper, we propose an effective information compensation based approach, namely Lightweight Dynamic Conditional GAN (LD-CGAN). LD-CGAN is a compact and structured single-stream network, and it consists of one generator and two independent discriminators to regularize and generate 642 and 1282 images in one feed-forward process. Specifically, the generator of LD-CGAN is composed of three major components: (1) Conditional Embedding (CE), which is an automatically unsupervised learning process aiming at disentangling integrated semantic attributes in the text space; (2) Conditional Manipulating Modular (CM-M) in Conditional Manipulating Block (CM-B), which is designed to continuously provide the image features with the compensation information (i.e., the disentangled attribute); and (3) Pyramid Attention Refine Block (PAR-B), which is used to enrich multi-scale features by capturing spatial importance between multi-scale context. Consequently, experiments conducted under two benchmark datasets, CUB and Oxford-102, indicate that our generated 1282 images can achieve comparable performance with 2562 images generated by the state-of-the-arts on two evaluation metrics: Inception Score (IS) and Visual-semantic Similarity (VS). Compared with the current state-of-the-art HDGAN, our LD-CGAN significantly decreases the number of parameters and computation time by 86.8% and 94.9%, respectively.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2020.107384