Image2Struct: Benchmarking Structure Extraction for Vision-Language Models

We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-10
Hauptverfasser: Josselin Somerville Roberts, Lee, Tony, Chi Heem Wong, Yasunaga, Michihiro, Mai, Yifan, Liang, Percy
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Josselin Somerville Roberts
Lee, Tony
Chi Heem Wong
Yasunaga, Michihiro
Mai, Yifan
Liang, Percy
description We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at https://crfm.stanford.edu/helm/image2struct/v1.0.1/.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3122763666</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3122763666</sourcerecordid><originalsourceid>FETCH-proquest_journals_31227636663</originalsourceid><addsrcrecordid>eNqNi0EKwjAURIMgWLR3CLgutD82FZdKRUVXitsSahpT20R_EvD4FvQArmZ482ZEImAsS5YLgAmJnWvTNAVeQJ6ziBz2vVASzh5D7Vd0LU197wU-tFH0CwNKWr49itpra2hjkV61G2pyFEaF4U1P9iY7NyPjRnROxr-ckvm2vGx2yRPtK0jnq9YGNMNUsQyg4Ixzzv6zPtiHPOY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3122763666</pqid></control><display><type>article</type><title>Image2Struct: Benchmarking Structure Extraction for Vision-Language Models</title><source>Freely Accessible Journals</source><creator>Josselin Somerville Roberts ; Lee, Tony ; Chi Heem Wong ; Yasunaga, Michihiro ; Mai, Yifan ; Liang, Percy</creator><creatorcontrib>Josselin Somerville Roberts ; Lee, Tony ; Chi Heem Wong ; Yasunaga, Michihiro ; Mai, Yifan ; Liang, Percy</creatorcontrib><description>We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at https://crfm.stanford.edu/helm/image2struct/v1.0.1/.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Benchmarks ; Domains ; Musical scores ; Similarity</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Josselin Somerville Roberts</creatorcontrib><creatorcontrib>Lee, Tony</creatorcontrib><creatorcontrib>Chi Heem Wong</creatorcontrib><creatorcontrib>Yasunaga, Michihiro</creatorcontrib><creatorcontrib>Mai, Yifan</creatorcontrib><creatorcontrib>Liang, Percy</creatorcontrib><title>Image2Struct: Benchmarking Structure Extraction for Vision-Language Models</title><title>arXiv.org</title><description>We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at https://crfm.stanford.edu/helm/image2struct/v1.0.1/.</description><subject>Benchmarks</subject><subject>Domains</subject><subject>Musical scores</subject><subject>Similarity</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNi0EKwjAURIMgWLR3CLgutD82FZdKRUVXitsSahpT20R_EvD4FvQArmZ482ZEImAsS5YLgAmJnWvTNAVeQJ6ziBz2vVASzh5D7Vd0LU197wU-tFH0CwNKWr49itpra2hjkV61G2pyFEaF4U1P9iY7NyPjRnROxr-ckvm2vGx2yRPtK0jnq9YGNMNUsQyg4Ixzzv6zPtiHPOY</recordid><startdate>20241029</startdate><enddate>20241029</enddate><creator>Josselin Somerville Roberts</creator><creator>Lee, Tony</creator><creator>Chi Heem Wong</creator><creator>Yasunaga, Michihiro</creator><creator>Mai, Yifan</creator><creator>Liang, Percy</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241029</creationdate><title>Image2Struct: Benchmarking Structure Extraction for Vision-Language Models</title><author>Josselin Somerville Roberts ; Lee, Tony ; Chi Heem Wong ; Yasunaga, Michihiro ; Mai, Yifan ; Liang, Percy</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31227636663</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Benchmarks</topic><topic>Domains</topic><topic>Musical scores</topic><topic>Similarity</topic><toplevel>online_resources</toplevel><creatorcontrib>Josselin Somerville Roberts</creatorcontrib><creatorcontrib>Lee, Tony</creatorcontrib><creatorcontrib>Chi Heem Wong</creatorcontrib><creatorcontrib>Yasunaga, Michihiro</creatorcontrib><creatorcontrib>Mai, Yifan</creatorcontrib><creatorcontrib>Liang, Percy</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Josselin Somerville Roberts</au><au>Lee, Tony</au><au>Chi Heem Wong</au><au>Yasunaga, Michihiro</au><au>Mai, Yifan</au><au>Liang, Percy</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Image2Struct: Benchmarking Structure Extraction for Vision-Language Models</atitle><jtitle>arXiv.org</jtitle><date>2024-10-29</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at https://crfm.stanford.edu/helm/image2struct/v1.0.1/.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-10
issn 2331-8422
language eng
recordid cdi_proquest_journals_3122763666
source Freely Accessible Journals
subjects Benchmarks
Domains
Musical scores
Similarity
title Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T08%3A37%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Image2Struct:%20Benchmarking%20Structure%20Extraction%20for%20Vision-Language%20Models&rft.jtitle=arXiv.org&rft.au=Josselin%20Somerville%20Roberts&rft.date=2024-10-29&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3122763666%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3122763666&rft_id=info:pmid/&rfr_iscdi=true