Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey

The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-12
Hauptverfasser:	Dang, Yunkai, Huang, Kaichen, Huo, Jiahao, Yan, Yibo, Huang, Sirui, Liu, Dongrui, Gao, Mengxi, Zhang, Jie, Chen, Qian, Wang, Kun, Liu, Yong, Shao, Jing, Xiong, Hui, Hu, Xuming
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial intelligence Computer vision Inference Large language models Task complexity Visual tasks
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Dang, Yunkai Huang, Kaichen Huo, Jiahao Yan, Yibo Huang, Sirui Liu, Dongrui Gao, Mengxi Zhang, Jie Chen, Qian Wang, Kun Liu, Yong Shao, Jing Xiong, Hui Hu, Xuming
description	The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities. Multimodal large language models (MLLMs), in particular, have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval. Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing transparency, trustworthiness, and reliability in high-stakes applications. This paper provides a comprehensive survey on the interpretability and explainability of MLLMs, proposing a novel framework that categorizes existing research across three perspectives: (I) Data, (II) Model, (III) Training \& Inference. We systematically analyze interpretability from token-level to embedding-level representations, assess approaches related to both architecture analysis and design, and explore training and inference strategies that enhance transparency. By comparing various methodologies, we identify their strengths and limitations and propose future research directions to address unresolved challenges in multimodal explainability. This survey offers a foundational resource for advancing interpretability and transparency in MLLMs, guiding researchers and practitioners toward developing more accountable and robust multimodal AI systems.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3140661939</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3140661939</sourcerecordid><originalsourceid>FETCH-proquest_journals_31406619393</originalsourceid><addsrcrecordid>eNqNit8KgjAchUcQJOU7DLoW5qaW3YUYBXVVdCsLf5kyN9sfqbdvRA_QzTmH73wTFFDG4midUDpDoTEdIYRmK5qmLEDX8jUI3kp-E4C5rPFBWtCDBvslJyds26uaC3zkugGfsnHcj5OqQZgN3uJC9d5_gDTtCPjs9AjvBZreuTAQ_nqOlrvyUuyjQaunA2OrTjkt_VWxOCFZFucsZ_9ZHwj5QSQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3140661939</pqid></control><display><type>article</type><title>Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey</title><source>Open Access: Freely Accessible Journals by multiple vendors</source><creator>Dang, Yunkai ; Huang, Kaichen ; Huo, Jiahao ; Yan, Yibo ; Huang, Sirui ; Liu, Dongrui ; Gao, Mengxi ; Zhang, Jie ; Chen, Qian ; Wang, Kun ; Liu, Yong ; Shao, Jing ; Xiong, Hui ; Hu, Xuming</creator><creatorcontrib>Dang, Yunkai ; Huang, Kaichen ; Huo, Jiahao ; Yan, Yibo ; Huang, Sirui ; Liu, Dongrui ; Gao, Mengxi ; Zhang, Jie ; Chen, Qian ; Wang, Kun ; Liu, Yong ; Shao, Jing ; Xiong, Hui ; Hu, Xuming</creatorcontrib><description>The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities. Multimodal large language models (MLLMs), in particular, have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval. Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing transparency, trustworthiness, and reliability in high-stakes applications. This paper provides a comprehensive survey on the interpretability and explainability of MLLMs, proposing a novel framework that categorizes existing research across three perspectives: (I) Data, (II) Model, (III) Training \& Inference. We systematically analyze interpretability from token-level to embedding-level representations, assess approaches related to both architecture analysis and design, and explore training and inference strategies that enhance transparency. By comparing various methodologies, we identify their strengths and limitations and propose future research directions to address unresolved challenges in multimodal explainability. This survey offers a foundational resource for advancing interpretability and transparency in MLLMs, guiding researchers and practitioners toward developing more accountable and robust multimodal AI systems.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial intelligence ; Computer vision ; Inference ; Large language models ; Task complexity ; Visual tasks</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Dang, Yunkai</creatorcontrib><creatorcontrib>Huang, Kaichen</creatorcontrib><creatorcontrib>Huo, Jiahao</creatorcontrib><creatorcontrib>Yan, Yibo</creatorcontrib><creatorcontrib>Huang, Sirui</creatorcontrib><creatorcontrib>Liu, Dongrui</creatorcontrib><creatorcontrib>Gao, Mengxi</creatorcontrib><creatorcontrib>Zhang, Jie</creatorcontrib><creatorcontrib>Chen, Qian</creatorcontrib><creatorcontrib>Wang, Kun</creatorcontrib><creatorcontrib>Liu, Yong</creatorcontrib><creatorcontrib>Shao, Jing</creatorcontrib><creatorcontrib>Xiong, Hui</creatorcontrib><creatorcontrib>Hu, Xuming</creatorcontrib><title>Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey</title><title>arXiv.org</title><description>The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities. Multimodal large language models (MLLMs), in particular, have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval. Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing transparency, trustworthiness, and reliability in high-stakes applications. This paper provides a comprehensive survey on the interpretability and explainability of MLLMs, proposing a novel framework that categorizes existing research across three perspectives: (I) Data, (II) Model, (III) Training \& Inference. We systematically analyze interpretability from token-level to embedding-level representations, assess approaches related to both architecture analysis and design, and explore training and inference strategies that enhance transparency. By comparing various methodologies, we identify their strengths and limitations and propose future research directions to address unresolved challenges in multimodal explainability. This survey offers a foundational resource for advancing interpretability and transparency in MLLMs, guiding researchers and practitioners toward developing more accountable and robust multimodal AI systems.</description><subject>Artificial intelligence</subject><subject>Computer vision</subject><subject>Inference</subject><subject>Large language models</subject><subject>Task complexity</subject><subject>Visual tasks</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNit8KgjAchUcQJOU7DLoW5qaW3YUYBXVVdCsLf5kyN9sfqbdvRA_QzTmH73wTFFDG4midUDpDoTEdIYRmK5qmLEDX8jUI3kp-E4C5rPFBWtCDBvslJyds26uaC3zkugGfsnHcj5OqQZgN3uJC9d5_gDTtCPjs9AjvBZreuTAQ_nqOlrvyUuyjQaunA2OrTjkt_VWxOCFZFucsZ_9ZHwj5QSQ</recordid><startdate>20241203</startdate><enddate>20241203</enddate><creator>Dang, Yunkai</creator><creator>Huang, Kaichen</creator><creator>Huo, Jiahao</creator><creator>Yan, Yibo</creator><creator>Huang, Sirui</creator><creator>Liu, Dongrui</creator><creator>Gao, Mengxi</creator><creator>Zhang, Jie</creator><creator>Chen, Qian</creator><creator>Wang, Kun</creator><creator>Liu, Yong</creator><creator>Shao, Jing</creator><creator>Xiong, Hui</creator><creator>Hu, Xuming</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241203</creationdate><title>Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey</title><author>Dang, Yunkai ; Huang, Kaichen ; Huo, Jiahao ; Yan, Yibo ; Huang, Sirui ; Liu, Dongrui ; Gao, Mengxi ; Zhang, Jie ; Chen, Qian ; Wang, Kun ; Liu, Yong ; Shao, Jing ; Xiong, Hui ; Hu, Xuming</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31406619393</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial intelligence</topic><topic>Computer vision</topic><topic>Inference</topic><topic>Large language models</topic><topic>Task complexity</topic><topic>Visual tasks</topic><toplevel>online_resources</toplevel><creatorcontrib>Dang, Yunkai</creatorcontrib><creatorcontrib>Huang, Kaichen</creatorcontrib><creatorcontrib>Huo, Jiahao</creatorcontrib><creatorcontrib>Yan, Yibo</creatorcontrib><creatorcontrib>Huang, Sirui</creatorcontrib><creatorcontrib>Liu, Dongrui</creatorcontrib><creatorcontrib>Gao, Mengxi</creatorcontrib><creatorcontrib>Zhang, Jie</creatorcontrib><creatorcontrib>Chen, Qian</creatorcontrib><creatorcontrib>Wang, Kun</creatorcontrib><creatorcontrib>Liu, Yong</creatorcontrib><creatorcontrib>Shao, Jing</creatorcontrib><creatorcontrib>Xiong, Hui</creatorcontrib><creatorcontrib>Hu, Xuming</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Dang, Yunkai</au><au>Huang, Kaichen</au><au>Huo, Jiahao</au><au>Yan, Yibo</au><au>Huang, Sirui</au><au>Liu, Dongrui</au><au>Gao, Mengxi</au><au>Zhang, Jie</au><au>Chen, Qian</au><au>Wang, Kun</au><au>Liu, Yong</au><au>Shao, Jing</au><au>Xiong, Hui</au><au>Hu, Xuming</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey</atitle><jtitle>arXiv.org</jtitle><date>2024-12-03</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing, respectively. The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities. Multimodal large language models (MLLMs), in particular, have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval. Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing transparency, trustworthiness, and reliability in high-stakes applications. This paper provides a comprehensive survey on the interpretability and explainability of MLLMs, proposing a novel framework that categorizes existing research across three perspectives: (I) Data, (II) Model, (III) Training \& Inference. We systematically analyze interpretability from token-level to embedding-level representations, assess approaches related to both architecture analysis and design, and explore training and inference strategies that enhance transparency. By comparing various methodologies, we identify their strengths and limitations and propose future research directions to address unresolved challenges in multimodal explainability. This survey offers a foundational resource for advancing interpretability and transparency in MLLMs, guiding researchers and practitioners toward developing more accountable and robust multimodal AI systems.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-12
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3140661939
source	Open Access: Freely Accessible Journals by multiple vendors
subjects	Artificial intelligence Computer vision Inference Large language models Task complexity Visual tasks
title	Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T14%3A33%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Explainable%20and%20Interpretable%20Multimodal%20Large%20Language%20Models:%20A%20Comprehensive%20Survey&rft.jtitle=arXiv.org&rft.au=Dang,%20Yunkai&rft.date=2024-12-03&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3140661939%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3140661939&rft_id=info:pmid/&rfr_iscdi=true