Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges inclu...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zhuo, Le, Du, Ruoyi, Xiao, Han, Li, Yangguang, Liu, Dongyang, Huang, Rongjie, Liu, Wenze, Zhao, Lirui, Wang, Fu-Yun, Ma, Zhanyu, Luo, Xu, Wang, Zehan, Zhang, Kaipeng, Zhu, Xiangyang, Liu, Si, Yue, Xiangyu, Liu, Dingning, Ouyang, Wanli, Liu, Ziwei, Qiao, Yu, Li, Hongsheng, Gao, Peng
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zhuo, Le Du, Ruoyi Xiao, Han Li, Yangguang Liu, Dongyang Huang, Rongjie Liu, Wenze Zhao, Lirui Wang, Fu-Yun Ma, Zhanyu Luo, Xu Wang, Zehan Zhang, Kaipeng Zhu, Xiangyang Liu, Si Yue, Xiangyu Liu, Dingning Ouyang, Wanli Liu, Ziwei Qiao, Yu Li, Hongsheng Gao, Peng
description	Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling.
doi_str_mv	10.48550/arxiv.2406.18583
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_18583</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_18583</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2406_185833</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw0zO0MLUw5mRw8CnNzcxL1PVLrSixUvBNzM7MS1eAioUYRSgElxTl56WnFikk5qUouCUWlwCZ5ZklGQogDboumSE8DKxpiTnFqbxQmptB3s01xNlDF2xXfEFRZm5iUWU8yM54sJ3GhFUAAF-KNWw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT</title><source>arXiv.org</source><creator>Zhuo, Le ; Du, Ruoyi ; Xiao, Han ; Li, Yangguang ; Liu, Dongyang ; Huang, Rongjie ; Liu, Wenze ; Zhao, Lirui ; Wang, Fu-Yun ; Ma, Zhanyu ; Luo, Xu ; Wang, Zehan ; Zhang, Kaipeng ; Zhu, Xiangyang ; Liu, Si ; Yue, Xiangyu ; Liu, Dingning ; Ouyang, Wanli ; Liu, Ziwei ; Qiao, Yu ; Li, Hongsheng ; Gao, Peng</creator><creatorcontrib>Zhuo, Le ; Du, Ruoyi ; Xiao, Han ; Li, Yangguang ; Liu, Dongyang ; Huang, Rongjie ; Liu, Wenze ; Zhao, Lirui ; Wang, Fu-Yun ; Ma, Zhanyu ; Luo, Xu ; Wang, Zehan ; Zhang, Kaipeng ; Zhu, Xiangyang ; Liu, Si ; Yue, Xiangyu ; Liu, Dingning ; Ouyang, Wanli ; Liu, Ziwei ; Qiao, Yu ; Li, Hongsheng ; Gao, Peng</creatorcontrib><description>Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling.</description><identifier>DOI: 10.48550/arxiv.2406.18583</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2024-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.18583$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.18583$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhuo, Le</creatorcontrib><creatorcontrib>Du, Ruoyi</creatorcontrib><creatorcontrib>Xiao, Han</creatorcontrib><creatorcontrib>Li, Yangguang</creatorcontrib><creatorcontrib>Liu, Dongyang</creatorcontrib><creatorcontrib>Huang, Rongjie</creatorcontrib><creatorcontrib>Liu, Wenze</creatorcontrib><creatorcontrib>Zhao, Lirui</creatorcontrib><creatorcontrib>Wang, Fu-Yun</creatorcontrib><creatorcontrib>Ma, Zhanyu</creatorcontrib><creatorcontrib>Luo, Xu</creatorcontrib><creatorcontrib>Wang, Zehan</creatorcontrib><creatorcontrib>Zhang, Kaipeng</creatorcontrib><creatorcontrib>Zhu, Xiangyang</creatorcontrib><creatorcontrib>Liu, Si</creatorcontrib><creatorcontrib>Yue, Xiangyu</creatorcontrib><creatorcontrib>Liu, Dingning</creatorcontrib><creatorcontrib>Ouyang, Wanli</creatorcontrib><creatorcontrib>Liu, Ziwei</creatorcontrib><creatorcontrib>Qiao, Yu</creatorcontrib><creatorcontrib>Li, Hongsheng</creatorcontrib><creatorcontrib>Gao, Peng</creatorcontrib><title>Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT</title><description>Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw0zO0MLUw5mRw8CnNzcxL1PVLrSixUvBNzM7MS1eAioUYRSgElxTl56WnFikk5qUouCUWlwCZ5ZklGQogDboumSE8DKxpiTnFqbxQmptB3s01xNlDF2xXfEFRZm5iUWU8yM54sJ3GhFUAAF-KNWw</recordid><startdate>20240605</startdate><enddate>20240605</enddate><creator>Zhuo, Le</creator><creator>Du, Ruoyi</creator><creator>Xiao, Han</creator><creator>Li, Yangguang</creator><creator>Liu, Dongyang</creator><creator>Huang, Rongjie</creator><creator>Liu, Wenze</creator><creator>Zhao, Lirui</creator><creator>Wang, Fu-Yun</creator><creator>Ma, Zhanyu</creator><creator>Luo, Xu</creator><creator>Wang, Zehan</creator><creator>Zhang, Kaipeng</creator><creator>Zhu, Xiangyang</creator><creator>Liu, Si</creator><creator>Yue, Xiangyu</creator><creator>Liu, Dingning</creator><creator>Ouyang, Wanli</creator><creator>Liu, Ziwei</creator><creator>Qiao, Yu</creator><creator>Li, Hongsheng</creator><creator>Gao, Peng</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240605</creationdate><title>Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT</title><author>Zhuo, Le ; Du, Ruoyi ; Xiao, Han ; Li, Yangguang ; Liu, Dongyang ; Huang, Rongjie ; Liu, Wenze ; Zhao, Lirui ; Wang, Fu-Yun ; Ma, Zhanyu ; Luo, Xu ; Wang, Zehan ; Zhang, Kaipeng ; Zhu, Xiangyang ; Liu, Si ; Yue, Xiangyu ; Liu, Dingning ; Ouyang, Wanli ; Liu, Ziwei ; Qiao, Yu ; Li, Hongsheng ; Gao, Peng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2406_185833</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhuo, Le</creatorcontrib><creatorcontrib>Du, Ruoyi</creatorcontrib><creatorcontrib>Xiao, Han</creatorcontrib><creatorcontrib>Li, Yangguang</creatorcontrib><creatorcontrib>Liu, Dongyang</creatorcontrib><creatorcontrib>Huang, Rongjie</creatorcontrib><creatorcontrib>Liu, Wenze</creatorcontrib><creatorcontrib>Zhao, Lirui</creatorcontrib><creatorcontrib>Wang, Fu-Yun</creatorcontrib><creatorcontrib>Ma, Zhanyu</creatorcontrib><creatorcontrib>Luo, Xu</creatorcontrib><creatorcontrib>Wang, Zehan</creatorcontrib><creatorcontrib>Zhang, Kaipeng</creatorcontrib><creatorcontrib>Zhu, Xiangyang</creatorcontrib><creatorcontrib>Liu, Si</creatorcontrib><creatorcontrib>Yue, Xiangyu</creatorcontrib><creatorcontrib>Liu, Dingning</creatorcontrib><creatorcontrib>Ouyang, Wanli</creatorcontrib><creatorcontrib>Liu, Ziwei</creatorcontrib><creatorcontrib>Qiao, Yu</creatorcontrib><creatorcontrib>Li, Hongsheng</creatorcontrib><creatorcontrib>Gao, Peng</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhuo, Le</au><au>Du, Ruoyi</au><au>Xiao, Han</au><au>Li, Yangguang</au><au>Liu, Dongyang</au><au>Huang, Rongjie</au><au>Liu, Wenze</au><au>Zhao, Lirui</au><au>Wang, Fu-Yun</au><au>Ma, Zhanyu</au><au>Luo, Xu</au><au>Wang, Zehan</au><au>Zhang, Kaipeng</au><au>Zhu, Xiangyang</au><au>Liu, Si</au><au>Yue, Xiangyu</au><au>Liu, Dingning</au><au>Ouyang, Wanli</au><au>Liu, Ziwei</au><au>Qiao, Yu</au><au>Li, Hongsheng</au><au>Gao, Peng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT</atitle><date>2024-06-05</date><risdate>2024</risdate><abstract>Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling.</abstract><doi>10.48550/arxiv.2406.18583</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2406.18583
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2406_18583
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
title	Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T03%3A27%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Lumina-Next:%20Making%20Lumina-T2X%20Stronger%20and%20Faster%20with%20Next-DiT&rft.au=Zhuo,%20Le&rft.date=2024-06-05&rft_id=info:doi/10.48550/arxiv.2406.18583&rft_dat=%3Carxiv_GOX%3E2406_18583%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true