From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents. However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the mos...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Multi-modal Large Language Models (MLLMs) have shown impressive abilities in
generating reasonable responses with respect to multi-modal contents. However,
there is still a wide gap between the performance of recent MLLM-based
applications and the expectation of the broad public, even though the most
powerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paper
strives to enhance understanding of the gap through the lens of a qualitative
study on the generalizability, trustworthiness, and causal reasoning
capabilities of recent proprietary and open-source MLLMs across four
modalities: ie, text, code, image, and video, ultimately aiming to improve the
transparency of MLLMs. We believe these properties are several representative
factors that define the reliability of MLLMs, in supporting various downstream
applications. To be specific, we evaluate the closed-source GPT-4 and Gemini
and 6 open-source LLMs and MLLMs. Overall we evaluate 230 manually designed
cases, where the qualitative results are then summarized into 12 scores (ie, 4
modalities times 3 properties). In total, we uncover 14 empirical findings that
are useful to understand the capabilities and limitations of both proprietary
and open-source MLLMs, towards more reliable downstream multi-modal
applications. |
---|---|
DOI: | 10.48550/arxiv.2401.15071 |