Visual grounding for desktop graphical user interfaces
Most instance perception and image understanding solutions focus mainly on natural images. However, applications for synthetic images, and more specifically, images of Graphical User Interfaces (GUI) remain limited. This hinders the development of autonomous computer-vision-powered Artificial Intell...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Most instance perception and image understanding solutions focus mainly on
natural images. However, applications for synthetic images, and more
specifically, images of Graphical User Interfaces (GUI) remain limited. This
hinders the development of autonomous computer-vision-powered Artificial
Intelligence (AI) agents. In this work, we present Instruction Visual Grounding
or IVG, a multi-modal solution for object identification in a GUI. More
precisely, given a natural language instruction and GUI screen, IVG locates the
coordinates of the element on the screen where the instruction would be
executed. To this end, we develop two methods. The first method is a three-part
architecture that relies on a combination of a Large Language Model (LLM) and
an object detection model. The second approach uses a multi-modal foundation
model. |
---|---|
DOI: | 10.48550/arxiv.2407.01558 |