xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (Vid...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Qin, Can Xia, Congying Ramakrishnan, Krithika Ryoo, Michael Tu, Lifu Feng, Yihao Shu, Manli Zhou, Honglu Awadalla, Anas Wang, Jun Purushwalkam, Senthil Xue, Le Zhou, Yingbo Wang, Huan Savarese, Silvio Niebles, Juan Carlos Chen, Zeyuan Xu, Ran Xiong, Caiming |
description | We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of
producing realistic scenes from textual descriptions. Building on recent
advancements, such as OpenAI's Sora, we explore the latent diffusion model
(LDM) architecture and introduce a video variational autoencoder (VidVAE).
VidVAE compresses video data both spatially and temporally, significantly
reducing the length of visual tokens and the computational demands associated
with generating long-sequence videos. To further address the computational
costs, we propose a divide-and-merge strategy that maintains temporal
consistency across video segments. Our Diffusion Transformer (DiT) model
incorporates spatial and temporal self-attention layers, enabling robust
generalization across different timeframes and aspect ratios. We have devised a
data processing pipeline from the very beginning and collected over 13M
high-quality video-text pairs. The pipeline includes multiple steps such as
clipping, text detection, motion estimation, aesthetics scoring, and dense
captioning based on our in-house video-LLM model. Training the VidVAE and DiT
models required approximately 40 and 642 H100 days, respectively. Our model
supports over 14-second 720p video generation in an end-to-end way and
demonstrates competitive performance against state-of-the-art T2V models. |
doi_str_mv | 10.48550/arxiv.2408.12590 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2408_12590</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2408_12590</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2408_125903</originalsourceid><addsrcrecordid>eNqFjr0OgjAURrs4GPUBnOwLFMtfgq5EZVbCSppwkZtAS9obhbcX0N3pO_lyhsPY3pdelMSxPCo74MsLIpl4fhCf5JoVww20KLAC8xi18M88w2cj6ulokUaew0CCzNfgk0INOHT8jdTw1HS9Beeg4neYCTQpQqPdlq1q1TrY_XbDDtdLnmZiKSh7i52yYzmXlEtJ-N_4AHruPvA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations</title><source>arXiv.org</source><creator>Qin, Can ; Xia, Congying ; Ramakrishnan, Krithika ; Ryoo, Michael ; Tu, Lifu ; Feng, Yihao ; Shu, Manli ; Zhou, Honglu ; Awadalla, Anas ; Wang, Jun ; Purushwalkam, Senthil ; Xue, Le ; Zhou, Yingbo ; Wang, Huan ; Savarese, Silvio ; Niebles, Juan Carlos ; Chen, Zeyuan ; Xu, Ran ; Xiong, Caiming</creator><creatorcontrib>Qin, Can ; Xia, Congying ; Ramakrishnan, Krithika ; Ryoo, Michael ; Tu, Lifu ; Feng, Yihao ; Shu, Manli ; Zhou, Honglu ; Awadalla, Anas ; Wang, Jun ; Purushwalkam, Senthil ; Xue, Le ; Zhou, Yingbo ; Wang, Huan ; Savarese, Silvio ; Niebles, Juan Carlos ; Chen, Zeyuan ; Xu, Ran ; Xiong, Caiming</creatorcontrib><description>We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of
producing realistic scenes from textual descriptions. Building on recent
advancements, such as OpenAI's Sora, we explore the latent diffusion model
(LDM) architecture and introduce a video variational autoencoder (VidVAE).
VidVAE compresses video data both spatially and temporally, significantly
reducing the length of visual tokens and the computational demands associated
with generating long-sequence videos. To further address the computational
costs, we propose a divide-and-merge strategy that maintains temporal
consistency across video segments. Our Diffusion Transformer (DiT) model
incorporates spatial and temporal self-attention layers, enabling robust
generalization across different timeframes and aspect ratios. We have devised a
data processing pipeline from the very beginning and collected over 13M
high-quality video-text pairs. The pipeline includes multiple steps such as
clipping, text detection, motion estimation, aesthetics scoring, and dense
captioning based on our in-house video-LLM model. Training the VidVAE and DiT
models required approximately 40 and 642 H100 days, respectively. Our model
supports over 14-second 720p video generation in an end-to-end way and
demonstrates competitive performance against state-of-the-art T2V models.</description><identifier>DOI: 10.48550/arxiv.2408.12590</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2408.12590$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2408.12590$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Qin, Can</creatorcontrib><creatorcontrib>Xia, Congying</creatorcontrib><creatorcontrib>Ramakrishnan, Krithika</creatorcontrib><creatorcontrib>Ryoo, Michael</creatorcontrib><creatorcontrib>Tu, Lifu</creatorcontrib><creatorcontrib>Feng, Yihao</creatorcontrib><creatorcontrib>Shu, Manli</creatorcontrib><creatorcontrib>Zhou, Honglu</creatorcontrib><creatorcontrib>Awadalla, Anas</creatorcontrib><creatorcontrib>Wang, Jun</creatorcontrib><creatorcontrib>Purushwalkam, Senthil</creatorcontrib><creatorcontrib>Xue, Le</creatorcontrib><creatorcontrib>Zhou, Yingbo</creatorcontrib><creatorcontrib>Wang, Huan</creatorcontrib><creatorcontrib>Savarese, Silvio</creatorcontrib><creatorcontrib>Niebles, Juan Carlos</creatorcontrib><creatorcontrib>Chen, Zeyuan</creatorcontrib><creatorcontrib>Xu, Ran</creatorcontrib><creatorcontrib>Xiong, Caiming</creatorcontrib><title>xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations</title><description>We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of
producing realistic scenes from textual descriptions. Building on recent
advancements, such as OpenAI's Sora, we explore the latent diffusion model
(LDM) architecture and introduce a video variational autoencoder (VidVAE).
VidVAE compresses video data both spatially and temporally, significantly
reducing the length of visual tokens and the computational demands associated
with generating long-sequence videos. To further address the computational
costs, we propose a divide-and-merge strategy that maintains temporal
consistency across video segments. Our Diffusion Transformer (DiT) model
incorporates spatial and temporal self-attention layers, enabling robust
generalization across different timeframes and aspect ratios. We have devised a
data processing pipeline from the very beginning and collected over 13M
high-quality video-text pairs. The pipeline includes multiple steps such as
clipping, text detection, motion estimation, aesthetics scoring, and dense
captioning based on our in-house video-LLM model. Training the VidVAE and DiT
models required approximately 40 and 642 H100 days, respectively. Our model
supports over 14-second 720p video generation in an end-to-end way and
demonstrates competitive performance against state-of-the-art T2V models.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjr0OgjAURrs4GPUBnOwLFMtfgq5EZVbCSppwkZtAS9obhbcX0N3pO_lyhsPY3pdelMSxPCo74MsLIpl4fhCf5JoVww20KLAC8xi18M88w2cj6ulokUaew0CCzNfgk0INOHT8jdTw1HS9Beeg4neYCTQpQqPdlq1q1TrY_XbDDtdLnmZiKSh7i52yYzmXlEtJ-N_4AHruPvA</recordid><startdate>20240822</startdate><enddate>20240822</enddate><creator>Qin, Can</creator><creator>Xia, Congying</creator><creator>Ramakrishnan, Krithika</creator><creator>Ryoo, Michael</creator><creator>Tu, Lifu</creator><creator>Feng, Yihao</creator><creator>Shu, Manli</creator><creator>Zhou, Honglu</creator><creator>Awadalla, Anas</creator><creator>Wang, Jun</creator><creator>Purushwalkam, Senthil</creator><creator>Xue, Le</creator><creator>Zhou, Yingbo</creator><creator>Wang, Huan</creator><creator>Savarese, Silvio</creator><creator>Niebles, Juan Carlos</creator><creator>Chen, Zeyuan</creator><creator>Xu, Ran</creator><creator>Xiong, Caiming</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240822</creationdate><title>xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations</title><author>Qin, Can ; Xia, Congying ; Ramakrishnan, Krithika ; Ryoo, Michael ; Tu, Lifu ; Feng, Yihao ; Shu, Manli ; Zhou, Honglu ; Awadalla, Anas ; Wang, Jun ; Purushwalkam, Senthil ; Xue, Le ; Zhou, Yingbo ; Wang, Huan ; Savarese, Silvio ; Niebles, Juan Carlos ; Chen, Zeyuan ; Xu, Ran ; Xiong, Caiming</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2408_125903</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Qin, Can</creatorcontrib><creatorcontrib>Xia, Congying</creatorcontrib><creatorcontrib>Ramakrishnan, Krithika</creatorcontrib><creatorcontrib>Ryoo, Michael</creatorcontrib><creatorcontrib>Tu, Lifu</creatorcontrib><creatorcontrib>Feng, Yihao</creatorcontrib><creatorcontrib>Shu, Manli</creatorcontrib><creatorcontrib>Zhou, Honglu</creatorcontrib><creatorcontrib>Awadalla, Anas</creatorcontrib><creatorcontrib>Wang, Jun</creatorcontrib><creatorcontrib>Purushwalkam, Senthil</creatorcontrib><creatorcontrib>Xue, Le</creatorcontrib><creatorcontrib>Zhou, Yingbo</creatorcontrib><creatorcontrib>Wang, Huan</creatorcontrib><creatorcontrib>Savarese, Silvio</creatorcontrib><creatorcontrib>Niebles, Juan Carlos</creatorcontrib><creatorcontrib>Chen, Zeyuan</creatorcontrib><creatorcontrib>Xu, Ran</creatorcontrib><creatorcontrib>Xiong, Caiming</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Qin, Can</au><au>Xia, Congying</au><au>Ramakrishnan, Krithika</au><au>Ryoo, Michael</au><au>Tu, Lifu</au><au>Feng, Yihao</au><au>Shu, Manli</au><au>Zhou, Honglu</au><au>Awadalla, Anas</au><au>Wang, Jun</au><au>Purushwalkam, Senthil</au><au>Xue, Le</au><au>Zhou, Yingbo</au><au>Wang, Huan</au><au>Savarese, Silvio</au><au>Niebles, Juan Carlos</au><au>Chen, Zeyuan</au><au>Xu, Ran</au><au>Xiong, Caiming</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations</atitle><date>2024-08-22</date><risdate>2024</risdate><abstract>We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of
producing realistic scenes from textual descriptions. Building on recent
advancements, such as OpenAI's Sora, we explore the latent diffusion model
(LDM) architecture and introduce a video variational autoencoder (VidVAE).
VidVAE compresses video data both spatially and temporally, significantly
reducing the length of visual tokens and the computational demands associated
with generating long-sequence videos. To further address the computational
costs, we propose a divide-and-merge strategy that maintains temporal
consistency across video segments. Our Diffusion Transformer (DiT) model
incorporates spatial and temporal self-attention layers, enabling robust
generalization across different timeframes and aspect ratios. We have devised a
data processing pipeline from the very beginning and collected over 13M
high-quality video-text pairs. The pipeline includes multiple steps such as
clipping, text detection, motion estimation, aesthetics scoring, and dense
captioning based on our in-house video-LLM model. Training the VidVAE and DiT
models required approximately 40 and 642 H100 days, respectively. Our model
supports over 14-second 720p video generation in an end-to-end way and
demonstrates competitive performance against state-of-the-art T2V models.</abstract><doi>10.48550/arxiv.2408.12590</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2408.12590 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2408_12590 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition |
title | xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T22%3A32%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=xGen-VideoSyn-1:%20High-fidelity%20Text-to-Video%20Synthesis%20with%20Compressed%20Representations&rft.au=Qin,%20Can&rft.date=2024-08-22&rft_id=info:doi/10.48550/arxiv.2408.12590&rft_dat=%3Carxiv_GOX%3E2408_12590%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |