SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM
Vision-extended LLMs have made significant strides in Visual Question Answering (VQA). Despite these advancements, VLLMs still encounter substantial difficulties in handling queries involving long-tail entities, with a tendency to produce erroneous or hallucinated responses. In this work, we introdu...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Qiu, Jielin Madotto, Andrea Lin, Zhaojiang Crook, Paul A Xu, Yifan Ethan Dong, Xin Luna Faloutsos, Christos Li, Lei Damavandi, Babak Moon, Seungwhan |
description | Vision-extended LLMs have made significant strides in Visual Question
Answering (VQA). Despite these advancements, VLLMs still encounter substantial
difficulties in handling queries involving long-tail entities, with a tendency
to produce erroneous or hallucinated responses. In this work, we introduce a
novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for
entity-centric VQA. This task aims to test the models' capabilities in
identifying entities and providing detailed, entity-specific knowledge. We have
developed the \textbf{SnapNTell Dataset}, distinct from traditional VQA
datasets: (1) It encompasses a wide range of categorized entities, each
represented by images and explicitly named in the answers; (2) It features QA
pairs that require extensive knowledge for accurate responses. The dataset is
organized into 22 major categories, containing 7,568 unique entities in total.
For each entity, we curated 10 illustrative images and crafted 10
knowledge-intensive QA pairs. To address this novel task, we devised a
scalable, efficient, and transparent retrieval-augmented multimodal LLM. Our
approach markedly outperforms existing methods on the SnapNTell dataset,
achieving a 66.5\% improvement in the BELURT score. We will soon make the
dataset and the source code publicly accessible. |
doi_str_mv | 10.48550/arxiv.2403.04735 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2403_04735</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2403_04735</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-2aa6928988107a4f285afbb17e0a316270c4d6a886369a0aa8725eeaa5742f153</originalsourceid><addsrcrecordid>eNotj81Og0AUhWfjwlQfwJXzAuAw_7gjpP4kVKMSt-QWLu0kMG1gaO3bS6ure3LznZN8hNwlLJZWKfYAw487xFwyETNphLom-OVh_1Zi1z3Spd-Cr53fzCm4cIpy9GFwNf124wQd_ZhwDG7naebHIw5n8OjCln7iTOFhJrJp088dbOhq6oLrd838LIrVDblqoRvx9v8uSPm0LPOXqHh_fs2zIgJtVMQBdMptam3CDMiWWwXtep0YZCASzQ2rZaPBWi10CgzAGq4QAZSRvE2UWJD7v9mLZ7UfXA_DqTr7Vhdf8QuOEVC3</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM</title><source>arXiv.org</source><creator>Qiu, Jielin ; Madotto, Andrea ; Lin, Zhaojiang ; Crook, Paul A ; Xu, Yifan Ethan ; Dong, Xin Luna ; Faloutsos, Christos ; Li, Lei ; Damavandi, Babak ; Moon, Seungwhan</creator><creatorcontrib>Qiu, Jielin ; Madotto, Andrea ; Lin, Zhaojiang ; Crook, Paul A ; Xu, Yifan Ethan ; Dong, Xin Luna ; Faloutsos, Christos ; Li, Lei ; Damavandi, Babak ; Moon, Seungwhan</creatorcontrib><description>Vision-extended LLMs have made significant strides in Visual Question
Answering (VQA). Despite these advancements, VLLMs still encounter substantial
difficulties in handling queries involving long-tail entities, with a tendency
to produce erroneous or hallucinated responses. In this work, we introduce a
novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for
entity-centric VQA. This task aims to test the models' capabilities in
identifying entities and providing detailed, entity-specific knowledge. We have
developed the \textbf{SnapNTell Dataset}, distinct from traditional VQA
datasets: (1) It encompasses a wide range of categorized entities, each
represented by images and explicitly named in the answers; (2) It features QA
pairs that require extensive knowledge for accurate responses. The dataset is
organized into 22 major categories, containing 7,568 unique entities in total.
For each entity, we curated 10 illustrative images and crafted 10
knowledge-intensive QA pairs. To address this novel task, we devised a
scalable, efficient, and transparent retrieval-augmented multimodal LLM. Our
approach markedly outperforms existing methods on the SnapNTell dataset,
achieving a 66.5\% improvement in the BELURT score. We will soon make the
dataset and the source code publicly accessible.</description><identifier>DOI: 10.48550/arxiv.2403.04735</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2403.04735$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.04735$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Qiu, Jielin</creatorcontrib><creatorcontrib>Madotto, Andrea</creatorcontrib><creatorcontrib>Lin, Zhaojiang</creatorcontrib><creatorcontrib>Crook, Paul A</creatorcontrib><creatorcontrib>Xu, Yifan Ethan</creatorcontrib><creatorcontrib>Dong, Xin Luna</creatorcontrib><creatorcontrib>Faloutsos, Christos</creatorcontrib><creatorcontrib>Li, Lei</creatorcontrib><creatorcontrib>Damavandi, Babak</creatorcontrib><creatorcontrib>Moon, Seungwhan</creatorcontrib><title>SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM</title><description>Vision-extended LLMs have made significant strides in Visual Question
Answering (VQA). Despite these advancements, VLLMs still encounter substantial
difficulties in handling queries involving long-tail entities, with a tendency
to produce erroneous or hallucinated responses. In this work, we introduce a
novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for
entity-centric VQA. This task aims to test the models' capabilities in
identifying entities and providing detailed, entity-specific knowledge. We have
developed the \textbf{SnapNTell Dataset}, distinct from traditional VQA
datasets: (1) It encompasses a wide range of categorized entities, each
represented by images and explicitly named in the answers; (2) It features QA
pairs that require extensive knowledge for accurate responses. The dataset is
organized into 22 major categories, containing 7,568 unique entities in total.
For each entity, we curated 10 illustrative images and crafted 10
knowledge-intensive QA pairs. To address this novel task, we devised a
scalable, efficient, and transparent retrieval-augmented multimodal LLM. Our
approach markedly outperforms existing methods on the SnapNTell dataset,
achieving a 66.5\% improvement in the BELURT score. We will soon make the
dataset and the source code publicly accessible.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81Og0AUhWfjwlQfwJXzAuAw_7gjpP4kVKMSt-QWLu0kMG1gaO3bS6ure3LznZN8hNwlLJZWKfYAw487xFwyETNphLom-OVh_1Zi1z3Spd-Cr53fzCm4cIpy9GFwNf124wQd_ZhwDG7naebHIw5n8OjCln7iTOFhJrJp088dbOhq6oLrd838LIrVDblqoRvx9v8uSPm0LPOXqHh_fs2zIgJtVMQBdMptam3CDMiWWwXtep0YZCASzQ2rZaPBWi10CgzAGq4QAZSRvE2UWJD7v9mLZ7UfXA_DqTr7Vhdf8QuOEVC3</recordid><startdate>20240307</startdate><enddate>20240307</enddate><creator>Qiu, Jielin</creator><creator>Madotto, Andrea</creator><creator>Lin, Zhaojiang</creator><creator>Crook, Paul A</creator><creator>Xu, Yifan Ethan</creator><creator>Dong, Xin Luna</creator><creator>Faloutsos, Christos</creator><creator>Li, Lei</creator><creator>Damavandi, Babak</creator><creator>Moon, Seungwhan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240307</creationdate><title>SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM</title><author>Qiu, Jielin ; Madotto, Andrea ; Lin, Zhaojiang ; Crook, Paul A ; Xu, Yifan Ethan ; Dong, Xin Luna ; Faloutsos, Christos ; Li, Lei ; Damavandi, Babak ; Moon, Seungwhan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-2aa6928988107a4f285afbb17e0a316270c4d6a886369a0aa8725eeaa5742f153</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Qiu, Jielin</creatorcontrib><creatorcontrib>Madotto, Andrea</creatorcontrib><creatorcontrib>Lin, Zhaojiang</creatorcontrib><creatorcontrib>Crook, Paul A</creatorcontrib><creatorcontrib>Xu, Yifan Ethan</creatorcontrib><creatorcontrib>Dong, Xin Luna</creatorcontrib><creatorcontrib>Faloutsos, Christos</creatorcontrib><creatorcontrib>Li, Lei</creatorcontrib><creatorcontrib>Damavandi, Babak</creatorcontrib><creatorcontrib>Moon, Seungwhan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Qiu, Jielin</au><au>Madotto, Andrea</au><au>Lin, Zhaojiang</au><au>Crook, Paul A</au><au>Xu, Yifan Ethan</au><au>Dong, Xin Luna</au><au>Faloutsos, Christos</au><au>Li, Lei</au><au>Damavandi, Babak</au><au>Moon, Seungwhan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM</atitle><date>2024-03-07</date><risdate>2024</risdate><abstract>Vision-extended LLMs have made significant strides in Visual Question
Answering (VQA). Despite these advancements, VLLMs still encounter substantial
difficulties in handling queries involving long-tail entities, with a tendency
to produce erroneous or hallucinated responses. In this work, we introduce a
novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for
entity-centric VQA. This task aims to test the models' capabilities in
identifying entities and providing detailed, entity-specific knowledge. We have
developed the \textbf{SnapNTell Dataset}, distinct from traditional VQA
datasets: (1) It encompasses a wide range of categorized entities, each
represented by images and explicitly named in the answers; (2) It features QA
pairs that require extensive knowledge for accurate responses. The dataset is
organized into 22 major categories, containing 7,568 unique entities in total.
For each entity, we curated 10 illustrative images and crafted 10
knowledge-intensive QA pairs. To address this novel task, we devised a
scalable, efficient, and transparent retrieval-augmented multimodal LLM. Our
approach markedly outperforms existing methods on the SnapNTell dataset,
achieving a 66.5\% improvement in the BELURT score. We will soon make the
dataset and the source code publicly accessible.</abstract><doi>10.48550/arxiv.2403.04735</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2403.04735 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2403_04735 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition |
title | SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T10%3A26%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SnapNTell:%20Enhancing%20Entity-Centric%20Visual%20Question%20Answering%20with%20Retrieval%20Augmented%20Multimodal%20LLM&rft.au=Qiu,%20Jielin&rft.date=2024-03-07&rft_id=info:doi/10.48550/arxiv.2403.04735&rft_dat=%3Carxiv_GOX%3E2403_04735%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |