xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (Vid...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Qin, Can, Xia, Congying, Ramakrishnan, Krithika, Ryoo, Michael, Tu, Lifu, Feng, Yihao, Shu, Manli, Zhou, Honglu, Awadalla, Anas, Wang, Jun, Purushwalkam, Senthil, Xue, Le, Zhou, Yingbo, Wang, Huan, Savarese, Silvio, Niebles, Juan Carlos, Chen, Zeyuan, Xu, Ran, Xiong, Caiming
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Qin, Can
Xia, Congying
Ramakrishnan, Krithika
Ryoo, Michael
Tu, Lifu
Feng, Yihao
Shu, Manli
Zhou, Honglu
Awadalla, Anas
Wang, Jun
Purushwalkam, Senthil
Xue, Le
Zhou, Yingbo
Wang, Huan
Savarese, Silvio
Niebles, Juan Carlos
Chen, Zeyuan
Xu, Ran
Xiong, Caiming
description We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.
doi_str_mv 10.48550/arxiv.2408.12590
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2408_12590</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2408_12590</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2408_125903</originalsourceid><addsrcrecordid>eNqFjr0OgjAURrs4GPUBnOwLFMtfgq5EZVbCSppwkZtAS9obhbcX0N3pO_lyhsPY3pdelMSxPCo74MsLIpl4fhCf5JoVww20KLAC8xi18M88w2cj6ulokUaew0CCzNfgk0INOHT8jdTw1HS9Beeg4neYCTQpQqPdlq1q1TrY_XbDDtdLnmZiKSh7i52yYzmXlEtJ-N_4AHruPvA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations</title><source>arXiv.org</source><creator>Qin, Can ; Xia, Congying ; Ramakrishnan, Krithika ; Ryoo, Michael ; Tu, Lifu ; Feng, Yihao ; Shu, Manli ; Zhou, Honglu ; Awadalla, Anas ; Wang, Jun ; Purushwalkam, Senthil ; Xue, Le ; Zhou, Yingbo ; Wang, Huan ; Savarese, Silvio ; Niebles, Juan Carlos ; Chen, Zeyuan ; Xu, Ran ; Xiong, Caiming</creator><creatorcontrib>Qin, Can ; Xia, Congying ; Ramakrishnan, Krithika ; Ryoo, Michael ; Tu, Lifu ; Feng, Yihao ; Shu, Manli ; Zhou, Honglu ; Awadalla, Anas ; Wang, Jun ; Purushwalkam, Senthil ; Xue, Le ; Zhou, Yingbo ; Wang, Huan ; Savarese, Silvio ; Niebles, Juan Carlos ; Chen, Zeyuan ; Xu, Ran ; Xiong, Caiming</creatorcontrib><description>We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.</description><identifier>DOI: 10.48550/arxiv.2408.12590</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2408.12590$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2408.12590$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Qin, Can</creatorcontrib><creatorcontrib>Xia, Congying</creatorcontrib><creatorcontrib>Ramakrishnan, Krithika</creatorcontrib><creatorcontrib>Ryoo, Michael</creatorcontrib><creatorcontrib>Tu, Lifu</creatorcontrib><creatorcontrib>Feng, Yihao</creatorcontrib><creatorcontrib>Shu, Manli</creatorcontrib><creatorcontrib>Zhou, Honglu</creatorcontrib><creatorcontrib>Awadalla, Anas</creatorcontrib><creatorcontrib>Wang, Jun</creatorcontrib><creatorcontrib>Purushwalkam, Senthil</creatorcontrib><creatorcontrib>Xue, Le</creatorcontrib><creatorcontrib>Zhou, Yingbo</creatorcontrib><creatorcontrib>Wang, Huan</creatorcontrib><creatorcontrib>Savarese, Silvio</creatorcontrib><creatorcontrib>Niebles, Juan Carlos</creatorcontrib><creatorcontrib>Chen, Zeyuan</creatorcontrib><creatorcontrib>Xu, Ran</creatorcontrib><creatorcontrib>Xiong, Caiming</creatorcontrib><title>xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations</title><description>We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjr0OgjAURrs4GPUBnOwLFMtfgq5EZVbCSppwkZtAS9obhbcX0N3pO_lyhsPY3pdelMSxPCo74MsLIpl4fhCf5JoVww20KLAC8xi18M88w2cj6ulokUaew0CCzNfgk0INOHT8jdTw1HS9Beeg4neYCTQpQqPdlq1q1TrY_XbDDtdLnmZiKSh7i52yYzmXlEtJ-N_4AHruPvA</recordid><startdate>20240822</startdate><enddate>20240822</enddate><creator>Qin, Can</creator><creator>Xia, Congying</creator><creator>Ramakrishnan, Krithika</creator><creator>Ryoo, Michael</creator><creator>Tu, Lifu</creator><creator>Feng, Yihao</creator><creator>Shu, Manli</creator><creator>Zhou, Honglu</creator><creator>Awadalla, Anas</creator><creator>Wang, Jun</creator><creator>Purushwalkam, Senthil</creator><creator>Xue, Le</creator><creator>Zhou, Yingbo</creator><creator>Wang, Huan</creator><creator>Savarese, Silvio</creator><creator>Niebles, Juan Carlos</creator><creator>Chen, Zeyuan</creator><creator>Xu, Ran</creator><creator>Xiong, Caiming</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240822</creationdate><title>xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations</title><author>Qin, Can ; Xia, Congying ; Ramakrishnan, Krithika ; Ryoo, Michael ; Tu, Lifu ; Feng, Yihao ; Shu, Manli ; Zhou, Honglu ; Awadalla, Anas ; Wang, Jun ; Purushwalkam, Senthil ; Xue, Le ; Zhou, Yingbo ; Wang, Huan ; Savarese, Silvio ; Niebles, Juan Carlos ; Chen, Zeyuan ; Xu, Ran ; Xiong, Caiming</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2408_125903</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Qin, Can</creatorcontrib><creatorcontrib>Xia, Congying</creatorcontrib><creatorcontrib>Ramakrishnan, Krithika</creatorcontrib><creatorcontrib>Ryoo, Michael</creatorcontrib><creatorcontrib>Tu, Lifu</creatorcontrib><creatorcontrib>Feng, Yihao</creatorcontrib><creatorcontrib>Shu, Manli</creatorcontrib><creatorcontrib>Zhou, Honglu</creatorcontrib><creatorcontrib>Awadalla, Anas</creatorcontrib><creatorcontrib>Wang, Jun</creatorcontrib><creatorcontrib>Purushwalkam, Senthil</creatorcontrib><creatorcontrib>Xue, Le</creatorcontrib><creatorcontrib>Zhou, Yingbo</creatorcontrib><creatorcontrib>Wang, Huan</creatorcontrib><creatorcontrib>Savarese, Silvio</creatorcontrib><creatorcontrib>Niebles, Juan Carlos</creatorcontrib><creatorcontrib>Chen, Zeyuan</creatorcontrib><creatorcontrib>Xu, Ran</creatorcontrib><creatorcontrib>Xiong, Caiming</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Qin, Can</au><au>Xia, Congying</au><au>Ramakrishnan, Krithika</au><au>Ryoo, Michael</au><au>Tu, Lifu</au><au>Feng, Yihao</au><au>Shu, Manli</au><au>Zhou, Honglu</au><au>Awadalla, Anas</au><au>Wang, Jun</au><au>Purushwalkam, Senthil</au><au>Xue, Le</au><au>Zhou, Yingbo</au><au>Wang, Huan</au><au>Savarese, Silvio</au><au>Niebles, Juan Carlos</au><au>Chen, Zeyuan</au><au>Xu, Ran</au><au>Xiong, Caiming</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations</atitle><date>2024-08-22</date><risdate>2024</risdate><abstract>We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.</abstract><doi>10.48550/arxiv.2408.12590</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2408.12590
ispartof
issn
language eng
recordid cdi_arxiv_primary_2408_12590
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition
title xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T22%3A32%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=xGen-VideoSyn-1:%20High-fidelity%20Text-to-Video%20Synthesis%20with%20Compressed%20Representations&rft.au=Qin,%20Can&rft.date=2024-08-22&rft_id=info:doi/10.48550/arxiv.2408.12590&rft_dat=%3Carxiv_GOX%3E2408_12590%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true