Text Conditioned Generative Adversarial Networks Generating Images and Videos: A Critical Review
Generative adversarial networks (GANs) have attained a lot of attention in the deep learning community and have been in focus for the past few years. GAN finds its application in numerous tasks, among which generation of images or videos from text is regarded as the most impressive application w.r.t...
Gespeichert in:
Veröffentlicht in: | SN computer science 2024-10, Vol.5 (7), p.935, Article 935 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Generative adversarial networks (GANs) have attained a lot of attention in the deep learning community and have been in focus for the past few years. GAN finds its application in numerous tasks, among which generation of images or videos from text is regarded as the most impressive application w.r.t computer vision and natural language processing. These GANs which take text as input and produce images or videos are referred to as Text Conditioned GANs. A significant amount of progress has been made in text generated images. Nonetheless, there are still some areas that require attention. Also, while the challenge of generating videos from text has been addressed, the number of studies is less in comparison to text to image studies. The primary goal of this work is to present a state of the art Text Conditioned GAN variants that includes Text to Image GANs, and Text to Video GANs. An in-depth comparative and critical analysis of the reviewed GANs has been presented, that points out their strengths and weaknesses as well as other findings for future advancements. We also provide a tabulation of the datasets and evaluation metrics used by these approaches which further offers insights into the most commonly used datasets and identifies those that need more attention in future research. Similarly, tabulating the evaluation metrics provides an insight into the regularly utilized metrics as well as the requirement for establishing new metrics, for, like video generation. Herein, generation of images from text has been studied in three broad categories which include text to image GANs, text to face GANs, and transformer based text to image GANs. Transformer networks have sparked a great interest in the computer vision community and have been adapted in a variety of vision and multi-modal learning tasks. So there is a need to study the same for the text conditioned task as well since only few works have been reported which apply transformers for text conditioned synthesis. We then survey video GANs which include both conditional and unconditional GANs. Compared to Text to Image GANs, the number of studies in text to video is very less, an analogy of which has also been shown with respect to time. We have also provided a brief overview of recently introduced diffusion models and finally we discuss the critical findings from our survey and provide an insight into areas for future research. |
---|---|
ISSN: | 2661-8907 2662-995X 2661-8907 |
DOI: | 10.1007/s42979-024-03289-z |