Unsupervised text-to-image synthesis

•We make the first attempt to train one text-to-image synthesis model in an unsupervised manner.•A novel visual concept discrimination loss is proposed to train both generator and discriminator, which not only encourages the generated image expressing the local visual concepts but also ensures the n...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition 2021-02, Vol.110, p.107573, Article 107573
Hauptverfasser: Dong, Yanlong, Zhang, Ying, Ma, Lin, Wang, Zhi, Luo, Jiebo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•We make the first attempt to train one text-to-image synthesis model in an unsupervised manner.•A novel visual concept discrimination loss is proposed to train both generator and discriminator, which not only encourages the generated image expressing the local visual concepts but also ensures the noisy visual concepts contained in the pseudo sentence being suppressed.•One global semantic consistency loss is used to ensure that the generated image semantically corresponds to the input real sentence.•Our proposed model can generate pleasant image for one given sentence, with no reliance on any image-text pair data, which even outperforms some text-to-image synthesis models trained in the supervised manner. Recently, text-to-image synthesis has achieved great progresses with the advancement of the Generative Adversarial Network (GAN). However, training the GAN models requires a large amount of pairwise image-text data, which is extremely labor-intensive to collect. In this paper, we make the first attempt to train a text-to-image synthesis model in an unsupervised manner, which does not require any human labeled image-text pair data. Specifically, we first rely on the visual concepts to bridge two independent image and sentence sets and thereby yield the pseudo image-text pair data, based on which one GAN model can thereby be initialized. One novel visual concept discrimination loss is proposed to train both generator and discriminator, which not only encourages the image expressing the true local visual concepts but also ensures the noisy visual concepts contained in the pseudo sentence being suppressed. Afterwards, one global semantic consistency regarding to the real sentence is used to adapt the pretrained GAN model to real sentences. Experimental results demonstrate that our proposed unsupervised training strategy is able to generate favorable images for given sentences, which even outperforms some existing models trained in the supervised manner. The code of this paper is available at https://github.com/dylls/Unsupervised_Text-to-Image_Synthesis.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2020.107573