What Large Language Models Bring to Text-rich VQA?

Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition. In this work, we focus on investigating the advantages and bottlenecks of LLM-based approaches in addressing this problem. To ad...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Liu, Xuejing, Tang, Wei, Ni, Xinzhe, Lu, Jinghui, Zhao, Rui, Li, Zechao, Tan, Fei
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Liu, Xuejing Tang, Wei Ni, Xinzhe Lu, Jinghui Zhao, Rui Li, Zechao Tan, Fei
description	Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition. In this work, we focus on investigating the advantages and bottlenecks of LLM-based approaches in addressing this problem. To address the above concern, we separate the vision and language modules, where we leverage external OCR models to recognize texts in the image and Large Language Models (LLMs) to answer the question given texts. The whole framework is training-free benefiting from the in-context ability of LLMs. This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets. Besides, based on the ablation study, we find that LLM brings stronger comprehension ability and may introduce helpful knowledge for the VQA problem. The bottleneck for LLM to address text-rich VQA problems may primarily lie in visual part. We also combine the OCR module with MLLMs and pleasantly find that the combination of OCR module with MLLM also works. It's worth noting that not all MLLMs can comprehend the OCR information, which provides insights into how to train an MLLM that preserves the abilities of LLM.
doi_str_mv	10.48550/arxiv.2311.07306
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_07306</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_07306</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-ee4c6f3998e0eef3e56feb79e472fead6bdb038f7cc40de584800ecea92d33043</originalsourceid><addsrcrecordid>eNotzs1uwjAQBGBfeqigD9ATfoGETdaxnRMC1D8pVYUUwTHa2OsQiUJl0oq-fSlwmZnT6BPiMYNU2aKAKcVT_5PmmGUpGAR9L_LNlgZZUez4nPvum87j_eB5d5SL2O87ORxkzachib3byvVqPhuLu0C7Iz_ceiTq56d6-ZpUHy9vy3mVkDY6YVZOByxLy8AckAsduDUlK5MHJq9b3wLaYJxT4LmwygKwYypzjwgKR2Jyvb2gm6_Yf1L8bf7xzQWPfzwrPjw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>What Large Language Models Bring to Text-rich VQA?</title><source>arXiv.org</source><creator>Liu, Xuejing ; Tang, Wei ; Ni, Xinzhe ; Lu, Jinghui ; Zhao, Rui ; Li, Zechao ; Tan, Fei</creator><creatorcontrib>Liu, Xuejing ; Tang, Wei ; Ni, Xinzhe ; Lu, Jinghui ; Zhao, Rui ; Li, Zechao ; Tan, Fei</creatorcontrib><description>Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition. In this work, we focus on investigating the advantages and bottlenecks of LLM-based approaches in addressing this problem. To address the above concern, we separate the vision and language modules, where we leverage external OCR models to recognize texts in the image and Large Language Models (LLMs) to answer the question given texts. The whole framework is training-free benefiting from the in-context ability of LLMs. This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets. Besides, based on the ablation study, we find that LLM brings stronger comprehension ability and may introduce helpful knowledge for the VQA problem. The bottleneck for LLM to address text-rich VQA problems may primarily lie in visual part. We also combine the OCR module with MLLMs and pleasantly find that the combination of OCR module with MLLM also works. It's worth noting that not all MLLMs can comprehend the OCR information, which provides insights into how to train an MLLM that preserves the abilities of LLM.</description><identifier>DOI: 10.48550/arxiv.2311.07306</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-11</creationdate><rights>http://creativecommons.org/licenses/by-nc-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.07306$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.07306$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Liu, Xuejing</creatorcontrib><creatorcontrib>Tang, Wei</creatorcontrib><creatorcontrib>Ni, Xinzhe</creatorcontrib><creatorcontrib>Lu, Jinghui</creatorcontrib><creatorcontrib>Zhao, Rui</creatorcontrib><creatorcontrib>Li, Zechao</creatorcontrib><creatorcontrib>Tan, Fei</creatorcontrib><title>What Large Language Models Bring to Text-rich VQA?</title><description>Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition. In this work, we focus on investigating the advantages and bottlenecks of LLM-based approaches in addressing this problem. To address the above concern, we separate the vision and language modules, where we leverage external OCR models to recognize texts in the image and Large Language Models (LLMs) to answer the question given texts. The whole framework is training-free benefiting from the in-context ability of LLMs. This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets. Besides, based on the ablation study, we find that LLM brings stronger comprehension ability and may introduce helpful knowledge for the VQA problem. The bottleneck for LLM to address text-rich VQA problems may primarily lie in visual part. We also combine the OCR module with MLLMs and pleasantly find that the combination of OCR module with MLLM also works. It's worth noting that not all MLLMs can comprehend the OCR information, which provides insights into how to train an MLLM that preserves the abilities of LLM.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzs1uwjAQBGBfeqigD9ATfoGETdaxnRMC1D8pVYUUwTHa2OsQiUJl0oq-fSlwmZnT6BPiMYNU2aKAKcVT_5PmmGUpGAR9L_LNlgZZUez4nPvum87j_eB5d5SL2O87ORxkzachib3byvVqPhuLu0C7Iz_ceiTq56d6-ZpUHy9vy3mVkDY6YVZOByxLy8AckAsduDUlK5MHJq9b3wLaYJxT4LmwygKwYypzjwgKR2Jyvb2gm6_Yf1L8bf7xzQWPfzwrPjw</recordid><startdate>20231113</startdate><enddate>20231113</enddate><creator>Liu, Xuejing</creator><creator>Tang, Wei</creator><creator>Ni, Xinzhe</creator><creator>Lu, Jinghui</creator><creator>Zhao, Rui</creator><creator>Li, Zechao</creator><creator>Tan, Fei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231113</creationdate><title>What Large Language Models Bring to Text-rich VQA?</title><author>Liu, Xuejing ; Tang, Wei ; Ni, Xinzhe ; Lu, Jinghui ; Zhao, Rui ; Li, Zechao ; Tan, Fei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-ee4c6f3998e0eef3e56feb79e472fead6bdb038f7cc40de584800ecea92d33043</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Liu, Xuejing</creatorcontrib><creatorcontrib>Tang, Wei</creatorcontrib><creatorcontrib>Ni, Xinzhe</creatorcontrib><creatorcontrib>Lu, Jinghui</creatorcontrib><creatorcontrib>Zhao, Rui</creatorcontrib><creatorcontrib>Li, Zechao</creatorcontrib><creatorcontrib>Tan, Fei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Liu, Xuejing</au><au>Tang, Wei</au><au>Ni, Xinzhe</au><au>Lu, Jinghui</au><au>Zhao, Rui</au><au>Li, Zechao</au><au>Tan, Fei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>What Large Language Models Bring to Text-rich VQA?</atitle><date>2023-11-13</date><risdate>2023</risdate><abstract>Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition. In this work, we focus on investigating the advantages and bottlenecks of LLM-based approaches in addressing this problem. To address the above concern, we separate the vision and language modules, where we leverage external OCR models to recognize texts in the image and Large Language Models (LLMs) to answer the question given texts. The whole framework is training-free benefiting from the in-context ability of LLMs. This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets. Besides, based on the ablation study, we find that LLM brings stronger comprehension ability and may introduce helpful knowledge for the VQA problem. The bottleneck for LLM to address text-rich VQA problems may primarily lie in visual part. We also combine the OCR module with MLLMs and pleasantly find that the combination of OCR module with MLLM also works. It's worth noting that not all MLLMs can comprehend the OCR information, which provides insights into how to train an MLLM that preserves the abilities of LLM.</abstract><doi>10.48550/arxiv.2311.07306</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2311.07306
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2311_07306
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	What Large Language Models Bring to Text-rich VQA?
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T11%3A35%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=What%20Large%20Language%20Models%20Bring%20to%20Text-rich%20VQA?&rft.au=Liu,%20Xuejing&rft.date=2023-11-13&rft_id=info:doi/10.48550/arxiv.2311.07306&rft_dat=%3Carxiv_GOX%3E2311_07306%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true