TA2V: Text-Audio Guided Video Generation

Recent conditional and unconditional video generation tasks have been accomplished mainly based on generative adversarial network (GAN), diffusion, and autoregressive models. However, in some circumstances, using only one modality cannot provide enough semantic information. Therefore, in this paper,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2024, Vol.26, p.7250-7264
Hauptverfasser:	Zhao, Minglu, Wang, Wenmin, Chen, Tongbao, Zhang, Rui, Li, Ruochen
Format:	Artikel
Sprache:	eng
Schlagworte:	Audio data Autoregressive models Computational modeling Data models Datasets deep learning diffusion Generative adversarial networks Generators Image processing Multimodal video generation Semantics Task analysis text-audio to video Three dimensional models Three-dimensional displays Video compression Video data VQ-GAN
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Recent conditional and unconditional video generation tasks have been accomplished mainly based on generative adversarial network (GAN), diffusion, and autoregressive models. However, in some circumstances, using only one modality cannot provide enough semantic information. Therefore, in this paper, we propose text-audio to video (TA2V) generation, a new task for generating realistic videos from two different guided modalities, text and audio, which has not been explored much thus far. Compared to image generation, video generation is a harder task because of the complexity of processing higher-dimensional data and scarcer suitable datasets, especially for multimodal video generation. To overcome these limitations, (i) we propose the Text&Audio-guided-Video-Maker (TAgVM) model, which consists of two modules: a text-guided video generator and a text&audio-guided video modifier. (ii) This model uses a 3D VQ-GAN to compress high-dimension video data to a low-dimension discrete sequence, followed by an autoregressive model to guide text-conditional generation in the latent space. Then, we apply a text&audio-guided diffusion model to the generated video scenes, providing additional semantic details corresponding to the audio and text. (iii) We introduce a newly produced music performance video dataset, the University of Rochester Multimodal Music Performance with Video-Audio-Text (URMP-VAT), and a landscape dataset, Landscape with Video-Audio-Text (Landscape-VAT), both of which include three modalities (text, audio, and video) that are aligned with each other. The results demonstrate that our model can create videos with satisfactory quality and semantic information.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2024.3362149