Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era

The rapid advancement of large language models (LLMs) and multimodal learning has transformed digital content creation and manipulation. Traditional visual editing tools require significant expertise, limiting accessibility. Recent strides in instruction-based editing have enabled intuitive interact...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Nguyen, Thanh Tam, Ren, Zhao, Pham, Trinh, Huynh, Thanh Trung, Nguyen, Phi Le, Yin, Hongzhi, Nguyen, Quoc Viet Hung
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition Computer Science - Human-Computer Interaction Computer Science - Learning Computer Science - Multimedia
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Nguyen, Thanh Tam Ren, Zhao Pham, Trinh Huynh, Thanh Trung Nguyen, Phi Le Yin, Hongzhi Nguyen, Quoc Viet Hung
description	The rapid advancement of large language models (LLMs) and multimodal learning has transformed digital content creation and manipulation. Traditional visual editing tools require significant expertise, limiting accessibility. Recent strides in instruction-based editing have enabled intuitive interaction with visual content, using natural language as a bridge between user intent and complex editing operations. This survey provides an overview of these techniques, focusing on how LLMs and multimodal models empower users to achieve precise visual modifications without deep technical knowledge. By synthesizing over 100 publications, we explore methods from generative adversarial networks to diffusion models, examining multimodal integration for fine-grained content control. We discuss practical applications across domains such as fashion, 3D scene manipulation, and video synthesis, highlighting increased accessibility and alignment with human intuition. Our survey compares existing literature, emphasizing LLM-empowered editing, and identifies key challenges to stimulate further research. We aim to democratize powerful visual editing across various industries, from entertainment to education. Interested readers are encouraged to access our repository at https://github.com/tamlhp/awesome-instruction-editing.
doi_str_mv	10.48550/arxiv.2411.09955
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2411_09955</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2411_09955</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2411_099553</originalsourceid><addsrcrecordid>eNqFjrEOgjAUALs4GPUDnHw_AILSRNwMQSWBRd1JYwt5CbTmtSXy90bi7nTLJXeMreMoTA6cR1tBbxzCXRLHYZSmnM_ZrdDWkX86NDq4eJRKQi7RoW4hM9qR6Sw0hqDoRassCC2h8p3DXkkURzjB3dOgRkANZVmBIrFks0Z0Vq1-XLDNOX9k12Cq1y_CXtBYfy_q6WL_3_gANDc8Yw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era</title><source>arXiv.org</source><creator>Nguyen, Thanh Tam ; Ren, Zhao ; Pham, Trinh ; Huynh, Thanh Trung ; Nguyen, Phi Le ; Yin, Hongzhi ; Nguyen, Quoc Viet Hung</creator><creatorcontrib>Nguyen, Thanh Tam ; Ren, Zhao ; Pham, Trinh ; Huynh, Thanh Trung ; Nguyen, Phi Le ; Yin, Hongzhi ; Nguyen, Quoc Viet Hung</creatorcontrib><description>The rapid advancement of large language models (LLMs) and multimodal learning has transformed digital content creation and manipulation. Traditional visual editing tools require significant expertise, limiting accessibility. Recent strides in instruction-based editing have enabled intuitive interaction with visual content, using natural language as a bridge between user intent and complex editing operations. This survey provides an overview of these techniques, focusing on how LLMs and multimodal models empower users to achieve precise visual modifications without deep technical knowledge. By synthesizing over 100 publications, we explore methods from generative adversarial networks to diffusion models, examining multimodal integration for fine-grained content control. We discuss practical applications across domains such as fashion, 3D scene manipulation, and video synthesis, highlighting increased accessibility and alignment with human intuition. Our survey compares existing literature, emphasizing LLM-empowered editing, and identifies key challenges to stimulate further research. We aim to democratize powerful visual editing across various industries, from entertainment to education. Interested readers are encouraged to access our repository at https://github.com/tamlhp/awesome-instruction-editing.</description><identifier>DOI: 10.48550/arxiv.2411.09955</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Human-Computer Interaction ; Computer Science - Learning ; Computer Science - Multimedia</subject><creationdate>2024-11</creationdate><rights>http://creativecommons.org/licenses/by-nc-nd/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2411.09955$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2411.09955$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Nguyen, Thanh Tam</creatorcontrib><creatorcontrib>Ren, Zhao</creatorcontrib><creatorcontrib>Pham, Trinh</creatorcontrib><creatorcontrib>Huynh, Thanh Trung</creatorcontrib><creatorcontrib>Nguyen, Phi Le</creatorcontrib><creatorcontrib>Yin, Hongzhi</creatorcontrib><creatorcontrib>Nguyen, Quoc Viet Hung</creatorcontrib><title>Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era</title><description>The rapid advancement of large language models (LLMs) and multimodal learning has transformed digital content creation and manipulation. Traditional visual editing tools require significant expertise, limiting accessibility. Recent strides in instruction-based editing have enabled intuitive interaction with visual content, using natural language as a bridge between user intent and complex editing operations. This survey provides an overview of these techniques, focusing on how LLMs and multimodal models empower users to achieve precise visual modifications without deep technical knowledge. By synthesizing over 100 publications, we explore methods from generative adversarial networks to diffusion models, examining multimodal integration for fine-grained content control. We discuss practical applications across domains such as fashion, 3D scene manipulation, and video synthesis, highlighting increased accessibility and alignment with human intuition. Our survey compares existing literature, emphasizing LLM-empowered editing, and identifies key challenges to stimulate further research. We aim to democratize powerful visual editing across various industries, from entertainment to education. Interested readers are encouraged to access our repository at https://github.com/tamlhp/awesome-instruction-editing.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Human-Computer Interaction</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Multimedia</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjrEOgjAUALs4GPUDnHw_AILSRNwMQSWBRd1JYwt5CbTmtSXy90bi7nTLJXeMreMoTA6cR1tBbxzCXRLHYZSmnM_ZrdDWkX86NDq4eJRKQi7RoW4hM9qR6Sw0hqDoRassCC2h8p3DXkkURzjB3dOgRkANZVmBIrFks0Z0Vq1-XLDNOX9k12Cq1y_CXtBYfy_q6WL_3_gANDc8Yw</recordid><startdate>20241115</startdate><enddate>20241115</enddate><creator>Nguyen, Thanh Tam</creator><creator>Ren, Zhao</creator><creator>Pham, Trinh</creator><creator>Huynh, Thanh Trung</creator><creator>Nguyen, Phi Le</creator><creator>Yin, Hongzhi</creator><creator>Nguyen, Quoc Viet Hung</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241115</creationdate><title>Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era</title><author>Nguyen, Thanh Tam ; Ren, Zhao ; Pham, Trinh ; Huynh, Thanh Trung ; Nguyen, Phi Le ; Yin, Hongzhi ; Nguyen, Quoc Viet Hung</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2411_099553</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Human-Computer Interaction</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Multimedia</topic><toplevel>online_resources</toplevel><creatorcontrib>Nguyen, Thanh Tam</creatorcontrib><creatorcontrib>Ren, Zhao</creatorcontrib><creatorcontrib>Pham, Trinh</creatorcontrib><creatorcontrib>Huynh, Thanh Trung</creatorcontrib><creatorcontrib>Nguyen, Phi Le</creatorcontrib><creatorcontrib>Yin, Hongzhi</creatorcontrib><creatorcontrib>Nguyen, Quoc Viet Hung</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Nguyen, Thanh Tam</au><au>Ren, Zhao</au><au>Pham, Trinh</au><au>Huynh, Thanh Trung</au><au>Nguyen, Phi Le</au><au>Yin, Hongzhi</au><au>Nguyen, Quoc Viet Hung</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era</atitle><date>2024-11-15</date><risdate>2024</risdate><abstract>The rapid advancement of large language models (LLMs) and multimodal learning has transformed digital content creation and manipulation. Traditional visual editing tools require significant expertise, limiting accessibility. Recent strides in instruction-based editing have enabled intuitive interaction with visual content, using natural language as a bridge between user intent and complex editing operations. This survey provides an overview of these techniques, focusing on how LLMs and multimodal models empower users to achieve precise visual modifications without deep technical knowledge. By synthesizing over 100 publications, we explore methods from generative adversarial networks to diffusion models, examining multimodal integration for fine-grained content control. We discuss practical applications across domains such as fashion, 3D scene manipulation, and video synthesis, highlighting increased accessibility and alignment with human intuition. Our survey compares existing literature, emphasizing LLM-empowered editing, and identifies key challenges to stimulate further research. We aim to democratize powerful visual editing across various industries, from entertainment to education. Interested readers are encouraged to access our repository at https://github.com/tamlhp/awesome-instruction-editing.</abstract><doi>10.48550/arxiv.2411.09955</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2411.09955
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2411_09955
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition Computer Science - Human-Computer Interaction Computer Science - Learning Computer Science - Multimedia
title	Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T06%3A12%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Instruction-Guided%20Editing%20Controls%20for%20Images%20and%20Multimedia:%20A%20Survey%20in%20LLM%20era&rft.au=Nguyen,%20Thanh%20Tam&rft.date=2024-11-15&rft_id=info:doi/10.48550/arxiv.2411.09955&rft_dat=%3Carxiv_GOX%3E2411_09955%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true