SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM

Vision-extended LLMs have made significant strides in Visual Question Answering (VQA). Despite these advancements, VLLMs still encounter substantial difficulties in handling queries involving long-tail entities, with a tendency to produce erroneous or hallucinated responses. In this work, we introdu...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Qiu, Jielin, Madotto, Andrea, Lin, Zhaojiang, Crook, Paul A, Xu, Yifan Ethan, Dong, Xin Luna, Faloutsos, Christos, Li, Lei, Damavandi, Babak, Moon, Seungwhan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Qiu, Jielin
Madotto, Andrea
Lin, Zhaojiang
Crook, Paul A
Xu, Yifan Ethan
Dong, Xin Luna
Faloutsos, Christos
Li, Lei
Damavandi, Babak
Moon, Seungwhan
description Vision-extended LLMs have made significant strides in Visual Question Answering (VQA). Despite these advancements, VLLMs still encounter substantial difficulties in handling queries involving long-tail entities, with a tendency to produce erroneous or hallucinated responses. In this work, we introduce a novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for entity-centric VQA. This task aims to test the models' capabilities in identifying entities and providing detailed, entity-specific knowledge. We have developed the \textbf{SnapNTell Dataset}, distinct from traditional VQA datasets: (1) It encompasses a wide range of categorized entities, each represented by images and explicitly named in the answers; (2) It features QA pairs that require extensive knowledge for accurate responses. The dataset is organized into 22 major categories, containing 7,568 unique entities in total. For each entity, we curated 10 illustrative images and crafted 10 knowledge-intensive QA pairs. To address this novel task, we devised a scalable, efficient, and transparent retrieval-augmented multimodal LLM. Our approach markedly outperforms existing methods on the SnapNTell dataset, achieving a 66.5\% improvement in the BELURT score. We will soon make the dataset and the source code publicly accessible.
doi_str_mv 10.48550/arxiv.2403.04735
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2403_04735</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2403_04735</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-2aa6928988107a4f285afbb17e0a316270c4d6a886369a0aa8725eeaa5742f153</originalsourceid><addsrcrecordid>eNotj81Og0AUhWfjwlQfwJXzAuAw_7gjpP4kVKMSt-QWLu0kMG1gaO3bS6ure3LznZN8hNwlLJZWKfYAw487xFwyETNphLom-OVh_1Zi1z3Spd-Cr53fzCm4cIpy9GFwNf124wQd_ZhwDG7naebHIw5n8OjCln7iTOFhJrJp088dbOhq6oLrd838LIrVDblqoRvx9v8uSPm0LPOXqHh_fs2zIgJtVMQBdMptam3CDMiWWwXtep0YZCASzQ2rZaPBWi10CgzAGq4QAZSRvE2UWJD7v9mLZ7UfXA_DqTr7Vhdf8QuOEVC3</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM</title><source>arXiv.org</source><creator>Qiu, Jielin ; Madotto, Andrea ; Lin, Zhaojiang ; Crook, Paul A ; Xu, Yifan Ethan ; Dong, Xin Luna ; Faloutsos, Christos ; Li, Lei ; Damavandi, Babak ; Moon, Seungwhan</creator><creatorcontrib>Qiu, Jielin ; Madotto, Andrea ; Lin, Zhaojiang ; Crook, Paul A ; Xu, Yifan Ethan ; Dong, Xin Luna ; Faloutsos, Christos ; Li, Lei ; Damavandi, Babak ; Moon, Seungwhan</creatorcontrib><description>Vision-extended LLMs have made significant strides in Visual Question Answering (VQA). Despite these advancements, VLLMs still encounter substantial difficulties in handling queries involving long-tail entities, with a tendency to produce erroneous or hallucinated responses. In this work, we introduce a novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for entity-centric VQA. This task aims to test the models' capabilities in identifying entities and providing detailed, entity-specific knowledge. We have developed the \textbf{SnapNTell Dataset}, distinct from traditional VQA datasets: (1) It encompasses a wide range of categorized entities, each represented by images and explicitly named in the answers; (2) It features QA pairs that require extensive knowledge for accurate responses. The dataset is organized into 22 major categories, containing 7,568 unique entities in total. For each entity, we curated 10 illustrative images and crafted 10 knowledge-intensive QA pairs. To address this novel task, we devised a scalable, efficient, and transparent retrieval-augmented multimodal LLM. Our approach markedly outperforms existing methods on the SnapNTell dataset, achieving a 66.5\% improvement in the BELURT score. We will soon make the dataset and the source code publicly accessible.</description><identifier>DOI: 10.48550/arxiv.2403.04735</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2403.04735$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.04735$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Qiu, Jielin</creatorcontrib><creatorcontrib>Madotto, Andrea</creatorcontrib><creatorcontrib>Lin, Zhaojiang</creatorcontrib><creatorcontrib>Crook, Paul A</creatorcontrib><creatorcontrib>Xu, Yifan Ethan</creatorcontrib><creatorcontrib>Dong, Xin Luna</creatorcontrib><creatorcontrib>Faloutsos, Christos</creatorcontrib><creatorcontrib>Li, Lei</creatorcontrib><creatorcontrib>Damavandi, Babak</creatorcontrib><creatorcontrib>Moon, Seungwhan</creatorcontrib><title>SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM</title><description>Vision-extended LLMs have made significant strides in Visual Question Answering (VQA). Despite these advancements, VLLMs still encounter substantial difficulties in handling queries involving long-tail entities, with a tendency to produce erroneous or hallucinated responses. In this work, we introduce a novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for entity-centric VQA. This task aims to test the models' capabilities in identifying entities and providing detailed, entity-specific knowledge. We have developed the \textbf{SnapNTell Dataset}, distinct from traditional VQA datasets: (1) It encompasses a wide range of categorized entities, each represented by images and explicitly named in the answers; (2) It features QA pairs that require extensive knowledge for accurate responses. The dataset is organized into 22 major categories, containing 7,568 unique entities in total. For each entity, we curated 10 illustrative images and crafted 10 knowledge-intensive QA pairs. To address this novel task, we devised a scalable, efficient, and transparent retrieval-augmented multimodal LLM. Our approach markedly outperforms existing methods on the SnapNTell dataset, achieving a 66.5\% improvement in the BELURT score. We will soon make the dataset and the source code publicly accessible.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81Og0AUhWfjwlQfwJXzAuAw_7gjpP4kVKMSt-QWLu0kMG1gaO3bS6ure3LznZN8hNwlLJZWKfYAw487xFwyETNphLom-OVh_1Zi1z3Spd-Cr53fzCm4cIpy9GFwNf124wQd_ZhwDG7naebHIw5n8OjCln7iTOFhJrJp088dbOhq6oLrd838LIrVDblqoRvx9v8uSPm0LPOXqHh_fs2zIgJtVMQBdMptam3CDMiWWwXtep0YZCASzQ2rZaPBWi10CgzAGq4QAZSRvE2UWJD7v9mLZ7UfXA_DqTr7Vhdf8QuOEVC3</recordid><startdate>20240307</startdate><enddate>20240307</enddate><creator>Qiu, Jielin</creator><creator>Madotto, Andrea</creator><creator>Lin, Zhaojiang</creator><creator>Crook, Paul A</creator><creator>Xu, Yifan Ethan</creator><creator>Dong, Xin Luna</creator><creator>Faloutsos, Christos</creator><creator>Li, Lei</creator><creator>Damavandi, Babak</creator><creator>Moon, Seungwhan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240307</creationdate><title>SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM</title><author>Qiu, Jielin ; Madotto, Andrea ; Lin, Zhaojiang ; Crook, Paul A ; Xu, Yifan Ethan ; Dong, Xin Luna ; Faloutsos, Christos ; Li, Lei ; Damavandi, Babak ; Moon, Seungwhan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-2aa6928988107a4f285afbb17e0a316270c4d6a886369a0aa8725eeaa5742f153</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Qiu, Jielin</creatorcontrib><creatorcontrib>Madotto, Andrea</creatorcontrib><creatorcontrib>Lin, Zhaojiang</creatorcontrib><creatorcontrib>Crook, Paul A</creatorcontrib><creatorcontrib>Xu, Yifan Ethan</creatorcontrib><creatorcontrib>Dong, Xin Luna</creatorcontrib><creatorcontrib>Faloutsos, Christos</creatorcontrib><creatorcontrib>Li, Lei</creatorcontrib><creatorcontrib>Damavandi, Babak</creatorcontrib><creatorcontrib>Moon, Seungwhan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Qiu, Jielin</au><au>Madotto, Andrea</au><au>Lin, Zhaojiang</au><au>Crook, Paul A</au><au>Xu, Yifan Ethan</au><au>Dong, Xin Luna</au><au>Faloutsos, Christos</au><au>Li, Lei</au><au>Damavandi, Babak</au><au>Moon, Seungwhan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM</atitle><date>2024-03-07</date><risdate>2024</risdate><abstract>Vision-extended LLMs have made significant strides in Visual Question Answering (VQA). Despite these advancements, VLLMs still encounter substantial difficulties in handling queries involving long-tail entities, with a tendency to produce erroneous or hallucinated responses. In this work, we introduce a novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for entity-centric VQA. This task aims to test the models' capabilities in identifying entities and providing detailed, entity-specific knowledge. We have developed the \textbf{SnapNTell Dataset}, distinct from traditional VQA datasets: (1) It encompasses a wide range of categorized entities, each represented by images and explicitly named in the answers; (2) It features QA pairs that require extensive knowledge for accurate responses. The dataset is organized into 22 major categories, containing 7,568 unique entities in total. For each entity, we curated 10 illustrative images and crafted 10 knowledge-intensive QA pairs. To address this novel task, we devised a scalable, efficient, and transparent retrieval-augmented multimodal LLM. Our approach markedly outperforms existing methods on the SnapNTell dataset, achieving a 66.5\% improvement in the BELURT score. We will soon make the dataset and the source code publicly accessible.</abstract><doi>10.48550/arxiv.2403.04735</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2403.04735
ispartof
issn
language eng
recordid cdi_arxiv_primary_2403_04735
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T10%3A26%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SnapNTell:%20Enhancing%20Entity-Centric%20Visual%20Question%20Answering%20with%20Retrieval%20Augmented%20Multimodal%20LLM&rft.au=Qiu,%20Jielin&rft.date=2024-03-07&rft_id=info:doi/10.48550/arxiv.2403.04735&rft_dat=%3Carxiv_GOX%3E2403_04735%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true