GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning

Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these is...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-04
Hauptverfasser: Lv, Jiaxi, Huang, Yi, Yan, Mingfu, Huang, Jiancheng, Liu, Jianzhuang, Liu, Yifan, Wen, Yafei, Chen, Xiaoxin, Chen, Shifeng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Lv, Jiaxi
Huang, Yi
Yan, Mingfu
Huang, Jiancheng
Liu, Jianzhuang
Liu, Yifan
Wen, Yafei
Chen, Xiaoxin
Chen, Shifeng
description Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for further explorations.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2892395789</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2892395789</sourcerecordid><originalsourceid>FETCH-proquest_journals_28923957893</originalsourceid><addsrcrecordid>eNqNzEEKwjAQheEgCIp6hwHXgZq0tnWpaN2IBYvbEuyoKWWiSSp6eyt6AFdv8T3-HhsKKWc8CYUYsIlzdRAEYh6LKJJDVmV5Ee6M14YWcDhZffOaLpBfX06fVANfcqAJCnx67g0_6goNZEho1QfhoRUsG6QKLd9bjeSxgq4LeaOIutyY9c-qcTj57YhNN-titeU3a-4tOl_WprXUUSmSVMg0ipNU_vd6A5WARY4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2892395789</pqid></control><display><type>article</type><title>GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning</title><source>Freely Accessible Journals</source><creator>Lv, Jiaxi ; Huang, Yi ; Yan, Mingfu ; Huang, Jiancheng ; Liu, Jianzhuang ; Liu, Yifan ; Wen, Yafei ; Chen, Xiaoxin ; Chen, Shifeng</creator><creatorcontrib>Lv, Jiaxi ; Huang, Yi ; Yan, Mingfu ; Huang, Jiancheng ; Liu, Jianzhuang ; Liu, Yifan ; Wen, Yafei ; Chen, Xiaoxin ; Chen, Shifeng</creatorcontrib><description>Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for further explorations.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Coherence ; Image enhancement ; Image processing ; Image quality ; Large language models ; Liquid flow ; Physical simulation ; Video</subject><ispartof>arXiv.org, 2024-04</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Lv, Jiaxi</creatorcontrib><creatorcontrib>Huang, Yi</creatorcontrib><creatorcontrib>Yan, Mingfu</creatorcontrib><creatorcontrib>Huang, Jiancheng</creatorcontrib><creatorcontrib>Liu, Jianzhuang</creatorcontrib><creatorcontrib>Liu, Yifan</creatorcontrib><creatorcontrib>Wen, Yafei</creatorcontrib><creatorcontrib>Chen, Xiaoxin</creatorcontrib><creatorcontrib>Chen, Shifeng</creatorcontrib><title>GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning</title><title>arXiv.org</title><description>Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for further explorations.</description><subject>Coherence</subject><subject>Image enhancement</subject><subject>Image processing</subject><subject>Image quality</subject><subject>Large language models</subject><subject>Liquid flow</subject><subject>Physical simulation</subject><subject>Video</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNzEEKwjAQheEgCIp6hwHXgZq0tnWpaN2IBYvbEuyoKWWiSSp6eyt6AFdv8T3-HhsKKWc8CYUYsIlzdRAEYh6LKJJDVmV5Ee6M14YWcDhZffOaLpBfX06fVANfcqAJCnx67g0_6goNZEho1QfhoRUsG6QKLd9bjeSxgq4LeaOIutyY9c-qcTj57YhNN-titeU3a-4tOl_WprXUUSmSVMg0ipNU_vd6A5WARY4</recordid><startdate>20240423</startdate><enddate>20240423</enddate><creator>Lv, Jiaxi</creator><creator>Huang, Yi</creator><creator>Yan, Mingfu</creator><creator>Huang, Jiancheng</creator><creator>Liu, Jianzhuang</creator><creator>Liu, Yifan</creator><creator>Wen, Yafei</creator><creator>Chen, Xiaoxin</creator><creator>Chen, Shifeng</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240423</creationdate><title>GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning</title><author>Lv, Jiaxi ; Huang, Yi ; Yan, Mingfu ; Huang, Jiancheng ; Liu, Jianzhuang ; Liu, Yifan ; Wen, Yafei ; Chen, Xiaoxin ; Chen, Shifeng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28923957893</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Coherence</topic><topic>Image enhancement</topic><topic>Image processing</topic><topic>Image quality</topic><topic>Large language models</topic><topic>Liquid flow</topic><topic>Physical simulation</topic><topic>Video</topic><toplevel>online_resources</toplevel><creatorcontrib>Lv, Jiaxi</creatorcontrib><creatorcontrib>Huang, Yi</creatorcontrib><creatorcontrib>Yan, Mingfu</creatorcontrib><creatorcontrib>Huang, Jiancheng</creatorcontrib><creatorcontrib>Liu, Jianzhuang</creatorcontrib><creatorcontrib>Liu, Yifan</creatorcontrib><creatorcontrib>Wen, Yafei</creatorcontrib><creatorcontrib>Chen, Xiaoxin</creatorcontrib><creatorcontrib>Chen, Shifeng</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lv, Jiaxi</au><au>Huang, Yi</au><au>Yan, Mingfu</au><au>Huang, Jiancheng</au><au>Liu, Jianzhuang</au><au>Liu, Yifan</au><au>Wen, Yafei</au><au>Chen, Xiaoxin</au><au>Chen, Shifeng</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning</atitle><jtitle>arXiv.org</jtitle><date>2024-04-23</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for further explorations.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-04
issn 2331-8422
language eng
recordid cdi_proquest_journals_2892395789
source Freely Accessible Journals
subjects Coherence
Image enhancement
Image processing
Image quality
Large language models
Liquid flow
Physical simulation
Video
title GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T10%3A12%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=GPT4Motion:%20Scripting%20Physical%20Motions%20in%20Text-to-Video%20Generation%20via%20Blender-Oriented%20GPT%20Planning&rft.jtitle=arXiv.org&rft.au=Lv,%20Jiaxi&rft.date=2024-04-23&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2892395789%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2892395789&rft_id=info:pmid/&rfr_iscdi=true