Visual grounding for desktop graphical user interfaces

Most instance perception and image understanding solutions focus mainly on natural images. However, applications for synthetic images, and more specifically, images of Graphical User Interfaces (GUI) remain limited. This hinders the development of autonomous computer-vision-powered Artificial Intell...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-09
Hauptverfasser:	Dardouri, Tassnim, Minkova, Laura, Jessica López Espejel, Dahhane, Walid, El Hassane Ettifouri
Format:	Artikel
Sprache:	eng
Schlagworte:	Agents (artificial intelligence) Artificial intelligence Computer vision Graphical user interface Large language models Object recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Most instance perception and image understanding solutions focus mainly on natural images. However, applications for synthetic images, and more specifically, images of Graphical User Interfaces (GUI) remain limited. This hinders the development of autonomous computer-vision-powered Artificial Intelligence (AI) agents. In this work, we present Instruction Visual Grounding or IVG, a multi-modal solution for object identification in a GUI. More precisely, given a natural language instruction and GUI screen, IVG locates the coordinates of the element on the screen where the instruction would be executed. To this end, we develop two methods. The first method is a three-part architecture that relies on a combination of a Large Language Model (LLM) and an object detection model. The second approach uses a multi-modal foundation model.
ISSN:	2331-8422