GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning
Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these is...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-04 |
---|---|
Hauptverfasser: | , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Lv, Jiaxi Huang, Yi Yan, Mingfu Huang, Jiancheng Liu, Jianzhuang Liu, Yifan Wen, Yafei Chen, Xiaoxin Chen, Shifeng |
description | Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for further explorations. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2892395789</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2892395789</sourcerecordid><originalsourceid>FETCH-proquest_journals_28923957893</originalsourceid><addsrcrecordid>eNqNzEEKwjAQheEgCIp6hwHXgZq0tnWpaN2IBYvbEuyoKWWiSSp6eyt6AFdv8T3-HhsKKWc8CYUYsIlzdRAEYh6LKJJDVmV5Ee6M14YWcDhZffOaLpBfX06fVANfcqAJCnx67g0_6goNZEho1QfhoRUsG6QKLd9bjeSxgq4LeaOIutyY9c-qcTj57YhNN-titeU3a-4tOl_WprXUUSmSVMg0ipNU_vd6A5WARY4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2892395789</pqid></control><display><type>article</type><title>GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning</title><source>Freely Accessible Journals</source><creator>Lv, Jiaxi ; Huang, Yi ; Yan, Mingfu ; Huang, Jiancheng ; Liu, Jianzhuang ; Liu, Yifan ; Wen, Yafei ; Chen, Xiaoxin ; Chen, Shifeng</creator><creatorcontrib>Lv, Jiaxi ; Huang, Yi ; Yan, Mingfu ; Huang, Jiancheng ; Liu, Jianzhuang ; Liu, Yifan ; Wen, Yafei ; Chen, Xiaoxin ; Chen, Shifeng</creatorcontrib><description>Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for further explorations.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Coherence ; Image enhancement ; Image processing ; Image quality ; Large language models ; Liquid flow ; Physical simulation ; Video</subject><ispartof>arXiv.org, 2024-04</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Lv, Jiaxi</creatorcontrib><creatorcontrib>Huang, Yi</creatorcontrib><creatorcontrib>Yan, Mingfu</creatorcontrib><creatorcontrib>Huang, Jiancheng</creatorcontrib><creatorcontrib>Liu, Jianzhuang</creatorcontrib><creatorcontrib>Liu, Yifan</creatorcontrib><creatorcontrib>Wen, Yafei</creatorcontrib><creatorcontrib>Chen, Xiaoxin</creatorcontrib><creatorcontrib>Chen, Shifeng</creatorcontrib><title>GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning</title><title>arXiv.org</title><description>Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for further explorations.</description><subject>Coherence</subject><subject>Image enhancement</subject><subject>Image processing</subject><subject>Image quality</subject><subject>Large language models</subject><subject>Liquid flow</subject><subject>Physical simulation</subject><subject>Video</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNzEEKwjAQheEgCIp6hwHXgZq0tnWpaN2IBYvbEuyoKWWiSSp6eyt6AFdv8T3-HhsKKWc8CYUYsIlzdRAEYh6LKJJDVmV5Ee6M14YWcDhZffOaLpBfX06fVANfcqAJCnx67g0_6goNZEho1QfhoRUsG6QKLd9bjeSxgq4LeaOIutyY9c-qcTj57YhNN-titeU3a-4tOl_WprXUUSmSVMg0ipNU_vd6A5WARY4</recordid><startdate>20240423</startdate><enddate>20240423</enddate><creator>Lv, Jiaxi</creator><creator>Huang, Yi</creator><creator>Yan, Mingfu</creator><creator>Huang, Jiancheng</creator><creator>Liu, Jianzhuang</creator><creator>Liu, Yifan</creator><creator>Wen, Yafei</creator><creator>Chen, Xiaoxin</creator><creator>Chen, Shifeng</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240423</creationdate><title>GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning</title><author>Lv, Jiaxi ; Huang, Yi ; Yan, Mingfu ; Huang, Jiancheng ; Liu, Jianzhuang ; Liu, Yifan ; Wen, Yafei ; Chen, Xiaoxin ; Chen, Shifeng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28923957893</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Coherence</topic><topic>Image enhancement</topic><topic>Image processing</topic><topic>Image quality</topic><topic>Large language models</topic><topic>Liquid flow</topic><topic>Physical simulation</topic><topic>Video</topic><toplevel>online_resources</toplevel><creatorcontrib>Lv, Jiaxi</creatorcontrib><creatorcontrib>Huang, Yi</creatorcontrib><creatorcontrib>Yan, Mingfu</creatorcontrib><creatorcontrib>Huang, Jiancheng</creatorcontrib><creatorcontrib>Liu, Jianzhuang</creatorcontrib><creatorcontrib>Liu, Yifan</creatorcontrib><creatorcontrib>Wen, Yafei</creatorcontrib><creatorcontrib>Chen, Xiaoxin</creatorcontrib><creatorcontrib>Chen, Shifeng</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lv, Jiaxi</au><au>Huang, Yi</au><au>Yan, Mingfu</au><au>Huang, Jiancheng</au><au>Liu, Jianzhuang</au><au>Liu, Yifan</au><au>Wen, Yafei</au><au>Chen, Xiaoxin</au><au>Chen, Shifeng</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning</atitle><jtitle>arXiv.org</jtitle><date>2024-04-23</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Recent advances in text-to-video generation have harnessed the power of diffusion models to create visually compelling content conditioned on text prompts. However, they usually encounter high computational costs and often struggle to produce videos with coherent physical motions. To tackle these issues, we propose GPT4Motion, a training-free framework that leverages the planning capability of large language models such as GPT, the physical simulation strength of Blender, and the excellent image generation ability of text-to-image diffusion models to enhance the quality of video synthesis. Specifically, GPT4Motion employs GPT-4 to generate a Blender script based on a user textual prompt, which commands Blender's built-in physics engine to craft fundamental scene components that encapsulate coherent physical motions across frames. Then these components are inputted into Stable Diffusion to generate a video aligned with the textual prompt. Experimental results on three basic physical motion scenarios, including rigid object drop and collision, cloth draping and swinging, and liquid flow, demonstrate that GPT4Motion can generate high-quality videos efficiently in maintaining motion coherency and entity consistency. GPT4Motion offers new insights in text-to-video research, enhancing its quality and broadening its horizon for further explorations.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-04 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2892395789 |
source | Freely Accessible Journals |
subjects | Coherence Image enhancement Image processing Image quality Large language models Liquid flow Physical simulation Video |
title | GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T10%3A12%3A53IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=GPT4Motion:%20Scripting%20Physical%20Motions%20in%20Text-to-Video%20Generation%20via%20Blender-Oriented%20GPT%20Planning&rft.jtitle=arXiv.org&rft.au=Lv,%20Jiaxi&rft.date=2024-04-23&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2892395789%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2892395789&rft_id=info:pmid/&rfr_iscdi=true |