GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static envir...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Chen, Dongping, Huang, Yue, Wu, Siyuan, Tang, Jingyu, Chen, Liuyi, Bai, Yilin, He, Zhigang, Wang, Chenlong, Zhou, Huichi, Li, Yiqiang, Zhou, Tianshuo, Yu, Yue, Gao, Chujie, Zhang, Qihui, Gui, Yi, Li, Zhen, Wan, Yao, Zhou, Pan, Gao, Jianfeng, Sun, Lichao
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Chen, Dongping Huang, Yue Wu, Siyuan Tang, Jingyu Chen, Liuyi Bai, Yilin He, Zhigang Wang, Chenlong Zhou, Huichi Li, Yiqiang Zhou, Tianshuo Yu, Yue Gao, Chujie Zhang, Qihui Gui, Yi Li, Zhen Wan, Yao Zhou, Pan Gao, Jianfeng Sun, Lichao
description	Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static environments and are predominantly applied in relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-World, we take the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using VideoLLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding. The code and dataset are publicly available at our project homepage: https://gui-world.github.io/.
doi_str_mv	10.48550/arxiv.2406.10819
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_10819</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_10819</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-2632d618bcb6d02b4eff6d4d341ff9dd18dfa62c13d9783ada1335e64339e5fa3</originalsourceid><addsrcrecordid>eNotj8FKAzEURbNxIbUf4Mr8QKZJXiZN3A2t1kJKQSouhzd9iQxMO5KZiv69bXV1F-dy4DB2r2RhXFnKGebv9qvQRtpCSaf8LatWb2vxvn0Ny0de8SWOOMSRpz7zC-hzG49jJL45dWN76Ak7HsJGNOcX8erjDIc7dpOwG-L0fyds9_y0W7yIsF2tF1UQaOdeaAuarHLNvrEkdWNiSpYMgVEpeSLlKKHVewXk5w6QUAGU0RoAH8uEMGEPf9prQ_2Z2wPmn_rSUl9b4BfdvEJW</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents</title><source>arXiv.org</source><creator>Chen, Dongping ; Huang, Yue ; Wu, Siyuan ; Tang, Jingyu ; Chen, Liuyi ; Bai, Yilin ; He, Zhigang ; Wang, Chenlong ; Zhou, Huichi ; Li, Yiqiang ; Zhou, Tianshuo ; Yu, Yue ; Gao, Chujie ; Zhang, Qihui ; Gui, Yi ; Li, Zhen ; Wan, Yao ; Zhou, Pan ; Gao, Jianfeng ; Sun, Lichao</creator><creatorcontrib>Chen, Dongping ; Huang, Yue ; Wu, Siyuan ; Tang, Jingyu ; Chen, Liuyi ; Bai, Yilin ; He, Zhigang ; Wang, Chenlong ; Zhou, Huichi ; Li, Yiqiang ; Zhou, Tianshuo ; Yu, Yue ; Gao, Chujie ; Zhang, Qihui ; Gui, Yi ; Li, Zhen ; Wan, Yao ; Zhou, Pan ; Gao, Jianfeng ; Sun, Lichao</creatorcontrib><description>Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static environments and are predominantly applied in relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-World, we take the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using VideoLLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding. The code and dataset are publicly available at our project homepage: https://gui-world.github.io/.</description><identifier>DOI: 10.48550/arxiv.2406.10819</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.10819$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.10819$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Dongping</creatorcontrib><creatorcontrib>Huang, Yue</creatorcontrib><creatorcontrib>Wu, Siyuan</creatorcontrib><creatorcontrib>Tang, Jingyu</creatorcontrib><creatorcontrib>Chen, Liuyi</creatorcontrib><creatorcontrib>Bai, Yilin</creatorcontrib><creatorcontrib>He, Zhigang</creatorcontrib><creatorcontrib>Wang, Chenlong</creatorcontrib><creatorcontrib>Zhou, Huichi</creatorcontrib><creatorcontrib>Li, Yiqiang</creatorcontrib><creatorcontrib>Zhou, Tianshuo</creatorcontrib><creatorcontrib>Yu, Yue</creatorcontrib><creatorcontrib>Gao, Chujie</creatorcontrib><creatorcontrib>Zhang, Qihui</creatorcontrib><creatorcontrib>Gui, Yi</creatorcontrib><creatorcontrib>Li, Zhen</creatorcontrib><creatorcontrib>Wan, Yao</creatorcontrib><creatorcontrib>Zhou, Pan</creatorcontrib><creatorcontrib>Gao, Jianfeng</creatorcontrib><creatorcontrib>Sun, Lichao</creatorcontrib><title>GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents</title><description>Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static environments and are predominantly applied in relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-World, we take the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using VideoLLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding. The code and dataset are publicly available at our project homepage: https://gui-world.github.io/.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FKAzEURbNxIbUf4Mr8QKZJXiZN3A2t1kJKQSouhzd9iQxMO5KZiv69bXV1F-dy4DB2r2RhXFnKGebv9qvQRtpCSaf8LatWb2vxvn0Ny0de8SWOOMSRpz7zC-hzG49jJL45dWN76Ak7HsJGNOcX8erjDIc7dpOwG-L0fyds9_y0W7yIsF2tF1UQaOdeaAuarHLNvrEkdWNiSpYMgVEpeSLlKKHVewXk5w6QUAGU0RoAH8uEMGEPf9prQ_2Z2wPmn_rSUl9b4BfdvEJW</recordid><startdate>20240616</startdate><enddate>20240616</enddate><creator>Chen, Dongping</creator><creator>Huang, Yue</creator><creator>Wu, Siyuan</creator><creator>Tang, Jingyu</creator><creator>Chen, Liuyi</creator><creator>Bai, Yilin</creator><creator>He, Zhigang</creator><creator>Wang, Chenlong</creator><creator>Zhou, Huichi</creator><creator>Li, Yiqiang</creator><creator>Zhou, Tianshuo</creator><creator>Yu, Yue</creator><creator>Gao, Chujie</creator><creator>Zhang, Qihui</creator><creator>Gui, Yi</creator><creator>Li, Zhen</creator><creator>Wan, Yao</creator><creator>Zhou, Pan</creator><creator>Gao, Jianfeng</creator><creator>Sun, Lichao</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240616</creationdate><title>GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents</title><author>Chen, Dongping ; Huang, Yue ; Wu, Siyuan ; Tang, Jingyu ; Chen, Liuyi ; Bai, Yilin ; He, Zhigang ; Wang, Chenlong ; Zhou, Huichi ; Li, Yiqiang ; Zhou, Tianshuo ; Yu, Yue ; Gao, Chujie ; Zhang, Qihui ; Gui, Yi ; Li, Zhen ; Wan, Yao ; Zhou, Pan ; Gao, Jianfeng ; Sun, Lichao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-2632d618bcb6d02b4eff6d4d341ff9dd18dfa62c13d9783ada1335e64339e5fa3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Dongping</creatorcontrib><creatorcontrib>Huang, Yue</creatorcontrib><creatorcontrib>Wu, Siyuan</creatorcontrib><creatorcontrib>Tang, Jingyu</creatorcontrib><creatorcontrib>Chen, Liuyi</creatorcontrib><creatorcontrib>Bai, Yilin</creatorcontrib><creatorcontrib>He, Zhigang</creatorcontrib><creatorcontrib>Wang, Chenlong</creatorcontrib><creatorcontrib>Zhou, Huichi</creatorcontrib><creatorcontrib>Li, Yiqiang</creatorcontrib><creatorcontrib>Zhou, Tianshuo</creatorcontrib><creatorcontrib>Yu, Yue</creatorcontrib><creatorcontrib>Gao, Chujie</creatorcontrib><creatorcontrib>Zhang, Qihui</creatorcontrib><creatorcontrib>Gui, Yi</creatorcontrib><creatorcontrib>Li, Zhen</creatorcontrib><creatorcontrib>Wan, Yao</creatorcontrib><creatorcontrib>Zhou, Pan</creatorcontrib><creatorcontrib>Gao, Jianfeng</creatorcontrib><creatorcontrib>Sun, Lichao</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Dongping</au><au>Huang, Yue</au><au>Wu, Siyuan</au><au>Tang, Jingyu</au><au>Chen, Liuyi</au><au>Bai, Yilin</au><au>He, Zhigang</au><au>Wang, Chenlong</au><au>Zhou, Huichi</au><au>Li, Yiqiang</au><au>Zhou, Tianshuo</au><au>Yu, Yue</au><au>Gao, Chujie</au><au>Zhang, Qihui</au><au>Gui, Yi</au><au>Li, Zhen</au><au>Wan, Yao</au><au>Zhou, Pan</au><au>Gao, Jianfeng</au><au>Sun, Lichao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents</atitle><date>2024-06-16</date><risdate>2024</risdate><abstract>Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding code. However, current agents primarily exhibit excellent understanding capabilities in static environments and are predominantly applied in relatively simple domains, such as Web or mobile interfaces. We argue that a robust GUI agent should be capable of perceiving temporal information on the GUI, including dynamic Web content and multi-step tasks. Additionally, it should possess a comprehensive understanding of various GUI scenarios, including desktop software and multi-window interactions. To this end, this paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations, extensively covering six GUI scenarios and eight types of GUI-oriented questions in three formats. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. Our findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-World, we take the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, we conclude that using VideoLLMs as GUI agents remains a significant challenge. We believe our work provides valuable insights for future research in dynamic GUI content understanding. The code and dataset are publicly available at our project homepage: https://gui-world.github.io/.</abstract><doi>10.48550/arxiv.2406.10819</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2406.10819
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2406_10819
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition
title	GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-10T13%3A37%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=GUI-WORLD:%20A%20Dataset%20for%20GUI-oriented%20Multimodal%20LLM-based%20Agents&rft.au=Chen,%20Dongping&rft.date=2024-06-16&rft_id=info:doi/10.48550/arxiv.2406.10819&rft_dat=%3Carxiv_GOX%3E2406_10819%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true