Generative Timelines for Instructed Visual Assembly

The objective of this work is to manipulate visual timelines (e.g. a video) through natural language instructions, making complex timeline editing tasks accessible to non-expert or potentially even disabled users. We call this task Instructed visual assembly. This task is challenging as it requires...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Pardo, Alejandro, Wang, Jui-Hsien, Ghanem, Bernard, Sivic, Josef, Russell, Bryan, Heilbron, Fabian Caba
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition Computer Science - Human-Computer Interaction Computer Science - Multimedia
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Pardo, Alejandro Wang, Jui-Hsien Ghanem, Bernard Sivic, Josef Russell, Bryan Heilbron, Fabian Caba
description	The objective of this work is to manipulate visual timelines (e.g. a video) through natural language instructions, making complex timeline editing tasks accessible to non-expert or potentially even disabled users. We call this task Instructed visual assembly. This task is challenging as it requires (i) identifying relevant visual content in the input timeline as well as retrieving relevant visual content in a given input (video) collection, (ii) understanding the input natural language instruction, and (iii) performing the desired edits of the input visual timeline to produce an output timeline. To address these challenges, we propose the Timeline Assembler, a generative model trained to perform instructed visual assembly tasks. The contributions of this work are three-fold. First, we develop a large multimodal language model, which is designed to process visual content, compactly represent timelines and accurately interpret timeline editing instructions. Second, we introduce a novel method for automatically generating datasets for visual assembly tasks, enabling efficient training of our model without the need for human-labeled data. Third, we validate our approach by creating two novel datasets for image and video assembly, demonstrating that the Timeline Assembler substantially outperforms established baseline models, including the recent GPT-4o, in accurately executing complex assembly instructions across various real-world inspired scenarios.
doi_str_mv	10.48550/arxiv.2411.12293
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2411_12293</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2411_12293</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2411_122933</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE01DM0MrI05mQwdk_NSy1KLMksS1UIycxNzcnMSy1WSMsvUvDMKy4pKk0uSU1RCMssLk3MUXAsLk7NTcqp5GFgTUvMKU7lhdLcDPJuriHOHrpg4-MLijJzE4sq40HWxIOtMSasAgAR6zIm</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Generative Timelines for Instructed Visual Assembly</title><source>arXiv.org</source><creator>Pardo, Alejandro ; Wang, Jui-Hsien ; Ghanem, Bernard ; Sivic, Josef ; Russell, Bryan ; Heilbron, Fabian Caba</creator><creatorcontrib>Pardo, Alejandro ; Wang, Jui-Hsien ; Ghanem, Bernard ; Sivic, Josef ; Russell, Bryan ; Heilbron, Fabian Caba</creatorcontrib><description>The objective of this work is to manipulate visual timelines (e.g. a video) through natural language instructions, making complex timeline editing tasks accessible to non-expert or potentially even disabled users. We call this task Instructed visual assembly. This task is challenging as it requires (i) identifying relevant visual content in the input timeline as well as retrieving relevant visual content in a given input (video) collection, (ii) understanding the input natural language instruction, and (iii) performing the desired edits of the input visual timeline to produce an output timeline. To address these challenges, we propose the Timeline Assembler, a generative model trained to perform instructed visual assembly tasks. The contributions of this work are three-fold. First, we develop a large multimodal language model, which is designed to process visual content, compactly represent timelines and accurately interpret timeline editing instructions. Second, we introduce a novel method for automatically generating datasets for visual assembly tasks, enabling efficient training of our model without the need for human-labeled data. Third, we validate our approach by creating two novel datasets for image and video assembly, demonstrating that the Timeline Assembler substantially outperforms established baseline models, including the recent GPT-4o, in accurately executing complex assembly instructions across various real-world inspired scenarios.</description><identifier>DOI: 10.48550/arxiv.2411.12293</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Human-Computer Interaction ; Computer Science - Multimedia</subject><creationdate>2024-11</creationdate><rights>http://creativecommons.org/licenses/by-nc-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2411.12293$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2411.12293$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Pardo, Alejandro</creatorcontrib><creatorcontrib>Wang, Jui-Hsien</creatorcontrib><creatorcontrib>Ghanem, Bernard</creatorcontrib><creatorcontrib>Sivic, Josef</creatorcontrib><creatorcontrib>Russell, Bryan</creatorcontrib><creatorcontrib>Heilbron, Fabian Caba</creatorcontrib><title>Generative Timelines for Instructed Visual Assembly</title><description>The objective of this work is to manipulate visual timelines (e.g. a video) through natural language instructions, making complex timeline editing tasks accessible to non-expert or potentially even disabled users. We call this task Instructed visual assembly. This task is challenging as it requires (i) identifying relevant visual content in the input timeline as well as retrieving relevant visual content in a given input (video) collection, (ii) understanding the input natural language instruction, and (iii) performing the desired edits of the input visual timeline to produce an output timeline. To address these challenges, we propose the Timeline Assembler, a generative model trained to perform instructed visual assembly tasks. The contributions of this work are three-fold. First, we develop a large multimodal language model, which is designed to process visual content, compactly represent timelines and accurately interpret timeline editing instructions. Second, we introduce a novel method for automatically generating datasets for visual assembly tasks, enabling efficient training of our model without the need for human-labeled data. Third, we validate our approach by creating two novel datasets for image and video assembly, demonstrating that the Timeline Assembler substantially outperforms established baseline models, including the recent GPT-4o, in accurately executing complex assembly instructions across various real-world inspired scenarios.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Human-Computer Interaction</subject><subject>Computer Science - Multimedia</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE01DM0MrI05mQwdk_NSy1KLMksS1UIycxNzcnMSy1WSMsvUvDMKy4pKk0uSU1RCMssLk3MUXAsLk7NTcqp5GFgTUvMKU7lhdLcDPJuriHOHrpg4-MLijJzE4sq40HWxIOtMSasAgAR6zIm</recordid><startdate>20241119</startdate><enddate>20241119</enddate><creator>Pardo, Alejandro</creator><creator>Wang, Jui-Hsien</creator><creator>Ghanem, Bernard</creator><creator>Sivic, Josef</creator><creator>Russell, Bryan</creator><creator>Heilbron, Fabian Caba</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241119</creationdate><title>Generative Timelines for Instructed Visual Assembly</title><author>Pardo, Alejandro ; Wang, Jui-Hsien ; Ghanem, Bernard ; Sivic, Josef ; Russell, Bryan ; Heilbron, Fabian Caba</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2411_122933</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Human-Computer Interaction</topic><topic>Computer Science - Multimedia</topic><toplevel>online_resources</toplevel><creatorcontrib>Pardo, Alejandro</creatorcontrib><creatorcontrib>Wang, Jui-Hsien</creatorcontrib><creatorcontrib>Ghanem, Bernard</creatorcontrib><creatorcontrib>Sivic, Josef</creatorcontrib><creatorcontrib>Russell, Bryan</creatorcontrib><creatorcontrib>Heilbron, Fabian Caba</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Pardo, Alejandro</au><au>Wang, Jui-Hsien</au><au>Ghanem, Bernard</au><au>Sivic, Josef</au><au>Russell, Bryan</au><au>Heilbron, Fabian Caba</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Generative Timelines for Instructed Visual Assembly</atitle><date>2024-11-19</date><risdate>2024</risdate><abstract>The objective of this work is to manipulate visual timelines (e.g. a video) through natural language instructions, making complex timeline editing tasks accessible to non-expert or potentially even disabled users. We call this task Instructed visual assembly. This task is challenging as it requires (i) identifying relevant visual content in the input timeline as well as retrieving relevant visual content in a given input (video) collection, (ii) understanding the input natural language instruction, and (iii) performing the desired edits of the input visual timeline to produce an output timeline. To address these challenges, we propose the Timeline Assembler, a generative model trained to perform instructed visual assembly tasks. The contributions of this work are three-fold. First, we develop a large multimodal language model, which is designed to process visual content, compactly represent timelines and accurately interpret timeline editing instructions. Second, we introduce a novel method for automatically generating datasets for visual assembly tasks, enabling efficient training of our model without the need for human-labeled data. Third, we validate our approach by creating two novel datasets for image and video assembly, demonstrating that the Timeline Assembler substantially outperforms established baseline models, including the recent GPT-4o, in accurately executing complex assembly instructions across various real-world inspired scenarios.</abstract><doi>10.48550/arxiv.2411.12293</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2411.12293
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2411_12293
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition Computer Science - Human-Computer Interaction Computer Science - Multimedia
title	Generative Timelines for Instructed Visual Assembly
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-21T20%3A26%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Generative%20Timelines%20for%20Instructed%20Visual%20Assembly&rft.au=Pardo,%20Alejandro&rft.date=2024-11-19&rft_id=info:doi/10.48550/arxiv.2411.12293&rft_dat=%3Carxiv_GOX%3E2411_12293%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true