What Large Language Models Bring to Text-rich VQA?

Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition. In this work, we focus on investigating the advantages and bottlenecks of LLM-based approaches in addressing this problem. To ad...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Liu, Xuejing, Tang, Wei, Ni, Xinzhe, Lu, Jinghui, Zhao, Rui, Li, Zechao, Tan, Fei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Liu, Xuejing
Tang, Wei
Ni, Xinzhe
Lu, Jinghui
Zhao, Rui
Li, Zechao
Tan, Fei
description Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition. In this work, we focus on investigating the advantages and bottlenecks of LLM-based approaches in addressing this problem. To address the above concern, we separate the vision and language modules, where we leverage external OCR models to recognize texts in the image and Large Language Models (LLMs) to answer the question given texts. The whole framework is training-free benefiting from the in-context ability of LLMs. This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets. Besides, based on the ablation study, we find that LLM brings stronger comprehension ability and may introduce helpful knowledge for the VQA problem. The bottleneck for LLM to address text-rich VQA problems may primarily lie in visual part. We also combine the OCR module with MLLMs and pleasantly find that the combination of OCR module with MLLM also works. It's worth noting that not all MLLMs can comprehend the OCR information, which provides insights into how to train an MLLM that preserves the abilities of LLM.
doi_str_mv 10.48550/arxiv.2311.07306
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_07306</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_07306</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-ee4c6f3998e0eef3e56feb79e472fead6bdb038f7cc40de584800ecea92d33043</originalsourceid><addsrcrecordid>eNotzs1uwjAQBGBfeqigD9ATfoGETdaxnRMC1D8pVYUUwTHa2OsQiUJl0oq-fSlwmZnT6BPiMYNU2aKAKcVT_5PmmGUpGAR9L_LNlgZZUez4nPvum87j_eB5d5SL2O87ORxkzachib3byvVqPhuLu0C7Iz_ceiTq56d6-ZpUHy9vy3mVkDY6YVZOByxLy8AckAsduDUlK5MHJq9b3wLaYJxT4LmwygKwYypzjwgKR2Jyvb2gm6_Yf1L8bf7xzQWPfzwrPjw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>What Large Language Models Bring to Text-rich VQA?</title><source>arXiv.org</source><creator>Liu, Xuejing ; Tang, Wei ; Ni, Xinzhe ; Lu, Jinghui ; Zhao, Rui ; Li, Zechao ; Tan, Fei</creator><creatorcontrib>Liu, Xuejing ; Tang, Wei ; Ni, Xinzhe ; Lu, Jinghui ; Zhao, Rui ; Li, Zechao ; Tan, Fei</creatorcontrib><description>Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition. In this work, we focus on investigating the advantages and bottlenecks of LLM-based approaches in addressing this problem. To address the above concern, we separate the vision and language modules, where we leverage external OCR models to recognize texts in the image and Large Language Models (LLMs) to answer the question given texts. The whole framework is training-free benefiting from the in-context ability of LLMs. This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets. Besides, based on the ablation study, we find that LLM brings stronger comprehension ability and may introduce helpful knowledge for the VQA problem. The bottleneck for LLM to address text-rich VQA problems may primarily lie in visual part. We also combine the OCR module with MLLMs and pleasantly find that the combination of OCR module with MLLM also works. It's worth noting that not all MLLMs can comprehend the OCR information, which provides insights into how to train an MLLM that preserves the abilities of LLM.</description><identifier>DOI: 10.48550/arxiv.2311.07306</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-11</creationdate><rights>http://creativecommons.org/licenses/by-nc-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.07306$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.07306$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Liu, Xuejing</creatorcontrib><creatorcontrib>Tang, Wei</creatorcontrib><creatorcontrib>Ni, Xinzhe</creatorcontrib><creatorcontrib>Lu, Jinghui</creatorcontrib><creatorcontrib>Zhao, Rui</creatorcontrib><creatorcontrib>Li, Zechao</creatorcontrib><creatorcontrib>Tan, Fei</creatorcontrib><title>What Large Language Models Bring to Text-rich VQA?</title><description>Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition. In this work, we focus on investigating the advantages and bottlenecks of LLM-based approaches in addressing this problem. To address the above concern, we separate the vision and language modules, where we leverage external OCR models to recognize texts in the image and Large Language Models (LLMs) to answer the question given texts. The whole framework is training-free benefiting from the in-context ability of LLMs. This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets. Besides, based on the ablation study, we find that LLM brings stronger comprehension ability and may introduce helpful knowledge for the VQA problem. The bottleneck for LLM to address text-rich VQA problems may primarily lie in visual part. We also combine the OCR module with MLLMs and pleasantly find that the combination of OCR module with MLLM also works. It's worth noting that not all MLLMs can comprehend the OCR information, which provides insights into how to train an MLLM that preserves the abilities of LLM.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzs1uwjAQBGBfeqigD9ATfoGETdaxnRMC1D8pVYUUwTHa2OsQiUJl0oq-fSlwmZnT6BPiMYNU2aKAKcVT_5PmmGUpGAR9L_LNlgZZUez4nPvum87j_eB5d5SL2O87ORxkzachib3byvVqPhuLu0C7Iz_ceiTq56d6-ZpUHy9vy3mVkDY6YVZOByxLy8AckAsduDUlK5MHJq9b3wLaYJxT4LmwygKwYypzjwgKR2Jyvb2gm6_Yf1L8bf7xzQWPfzwrPjw</recordid><startdate>20231113</startdate><enddate>20231113</enddate><creator>Liu, Xuejing</creator><creator>Tang, Wei</creator><creator>Ni, Xinzhe</creator><creator>Lu, Jinghui</creator><creator>Zhao, Rui</creator><creator>Li, Zechao</creator><creator>Tan, Fei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231113</creationdate><title>What Large Language Models Bring to Text-rich VQA?</title><author>Liu, Xuejing ; Tang, Wei ; Ni, Xinzhe ; Lu, Jinghui ; Zhao, Rui ; Li, Zechao ; Tan, Fei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-ee4c6f3998e0eef3e56feb79e472fead6bdb038f7cc40de584800ecea92d33043</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Liu, Xuejing</creatorcontrib><creatorcontrib>Tang, Wei</creatorcontrib><creatorcontrib>Ni, Xinzhe</creatorcontrib><creatorcontrib>Lu, Jinghui</creatorcontrib><creatorcontrib>Zhao, Rui</creatorcontrib><creatorcontrib>Li, Zechao</creatorcontrib><creatorcontrib>Tan, Fei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Liu, Xuejing</au><au>Tang, Wei</au><au>Ni, Xinzhe</au><au>Lu, Jinghui</au><au>Zhao, Rui</au><au>Li, Zechao</au><au>Tan, Fei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>What Large Language Models Bring to Text-rich VQA?</atitle><date>2023-11-13</date><risdate>2023</risdate><abstract>Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition. In this work, we focus on investigating the advantages and bottlenecks of LLM-based approaches in addressing this problem. To address the above concern, we separate the vision and language modules, where we leverage external OCR models to recognize texts in the image and Large Language Models (LLMs) to answer the question given texts. The whole framework is training-free benefiting from the in-context ability of LLMs. This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets. Besides, based on the ablation study, we find that LLM brings stronger comprehension ability and may introduce helpful knowledge for the VQA problem. The bottleneck for LLM to address text-rich VQA problems may primarily lie in visual part. We also combine the OCR module with MLLMs and pleasantly find that the combination of OCR module with MLLM also works. It's worth noting that not all MLLMs can comprehend the OCR information, which provides insights into how to train an MLLM that preserves the abilities of LLM.</abstract><doi>10.48550/arxiv.2311.07306</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2311.07306
ispartof
issn
language eng
recordid cdi_arxiv_primary_2311_07306
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title What Large Language Models Bring to Text-rich VQA?
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T11%3A35%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=What%20Large%20Language%20Models%20Bring%20to%20Text-rich%20VQA?&rft.au=Liu,%20Xuejing&rft.date=2023-11-13&rft_id=info:doi/10.48550/arxiv.2311.07306&rft_dat=%3Carxiv_GOX%3E2311_07306%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true